Skip to content

Conversation

@slaren
Copy link
Member

@slaren slaren commented Nov 14, 2025

No description provided.

@slaren slaren requested a review from ggerganov as a code owner November 14, 2025 20:39
@slaren slaren marked this pull request as draft November 14, 2025 20:40
@github-actions github-actions bot added testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 14, 2025
@slaren slaren marked this pull request as ready for review November 14, 2025 22:04
@ggerganov
Copy link
Member

I tried this change after reverting #17143 but it doesn't trigger an error using the llama-batched-bench command there. I do see it going through this branch:

GGML_LOG_DEBUG("%s: cannot reallocate multi buffer graph automatically, call reserve\n", __func__);
#endif
return false;
}

@slaren
Copy link
Member Author

slaren commented Nov 15, 2025

I was trying to reproduce this, but I get this assert when running llama-batched-bench (with the current version, without reverting #17143):

llama_kv_cache:      Metal KV buffer size =  5310.00 MiB
llama_kv_cache: size = 5310.00 MiB (151040 cells,  36 layers, 16/1 seqs), K (f16): 2655.00 MiB, V (f16): 2655.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   450.51 MiB
llama_context:        CPU compute buffer size =   299.01 MiB
llama_context: graph nodes  = 1231
llama_context: graph splits = 2
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x141a07b00 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1            0x141a08600 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x141a088c0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32_imrope=0'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32_imrope=0                 0x141a08b80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x141a09340 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x141a09600 | th_max = 1024 | th_width =   32
Assertion failed: (ggml_metal_op_flash_attn_ext_extra_pad(op) == 0), function ggml_metal_op_flash_attn_ext, file ggml-metal-ops.cpp, line 2367.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants