ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc #17276

slaren · 2025-11-14T20:39:20Z

No description provided.

Enabled in ggml-ci for testing.

ggerganov · 2025-11-15T09:53:24Z

I tried this change after reverting #17143 but it doesn't trigger an error using the llama-batched-bench command there. I do see it going through this branch:

llama.cpp/ggml/src/ggml-alloc.c

Lines 1052 to 1055 in 6d90fe9

    
                       GGML_LOG_DEBUG("%s: cannot reallocate multi buffer graph automatically, call reserve\n", __func__); 
        
           #endif 
        
                       return false; 
        
                   }

slaren · 2025-11-15T11:50:07Z

I was trying to reproduce this, but I get this assert when running llama-batched-bench (with the current version, without reverting #17143):

llama_kv_cache:      Metal KV buffer size =  5310.00 MiB
llama_kv_cache: size = 5310.00 MiB (151040 cells,  36 layers, 16/1 seqs), K (f16): 2655.00 MiB, V (f16): 2655.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   450.51 MiB
llama_context:        CPU compute buffer size =   299.01 MiB
llama_context: graph nodes  = 1231
llama_context: graph splits = 2
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x141a07b00 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1            0x141a08600 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x141a088c0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32_imrope=0'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32_imrope=0                 0x141a08b80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x141a09340 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x141a09600 | th_max = 1024 | th_width =   32
Assertion failed: (ggml_metal_op_flash_attn_ext_extra_pad(op) == 0), function ggml_metal_op_flash_attn_ext, file ggml-metal-ops.cpp, line 2367.

slaren requested a review from ggerganov as a code owner November 14, 2025 20:39

slaren marked this pull request as draft November 14, 2025 20:40

ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc

6d90fe9

Enabled in ggml-ci for testing.

slaren force-pushed the sl/realloc-error branch from 3df2f6d to 6d90fe9 Compare November 14, 2025 21:04

github-actions bot added testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 14, 2025

DajanaV mentioned this pull request Nov 14, 2025

UPSTREAM PR #17276: ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc auroralabs-loci/llama.cpp#215

Open

slaren marked this pull request as ready for review November 14, 2025 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc #17276

ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc #17276

slaren commented Nov 14, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

slaren commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc #17276

Are you sure you want to change the base?

ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc #17276

Conversation

slaren commented Nov 14, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

slaren commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants