Skip to content

imatrix : warn when GGUF imatrix is saved without .gguf suffix #15076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 4, 2025

Conversation

compilade
Copy link
Collaborator

Follow-up from #14842 (review), because since then imatrix is written as GGUF by default.
This new warning should make it more obvious when a GGUF imatrix is generated, but not necessarily desired (see also ikawrakow/ik_llama.cpp#659).

This makes the following warning when --output-format is not specified and the output file doesn't end with .gguf:

save_imatrix: saving imatrix using GGUF format with a different suffix than .gguf
save_imatrix: if you want the previous imatrix format, use --output-format dat
Full output (click to expand)
$ ./bin/llama-imatrix -m ../../models/gguf/FloatLM-99M-F16.gguf -f ../../models/wikitext-2-raw/calibration_datav3.txt --chunks 25 -o test-imat.dat
build: 6086 (342e7014d) with gcc (GCC) 14.2.1 20250322 for x86_64-unknown-linux-gnu
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from ../../models/gguf/FloatLM-99M-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = FloatLM_99M
llama_model_loader: - kv   3:                         general.size_label str              = 100M
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                          llama.block_count u32              = 16
llama_model_loader: - kv   6:                       llama.context_length u32              = 2048
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 512
llama_model_loader: - kv   8:                  llama.feed_forward_length u32              = 1280
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 8
llama_model_loader: - kv  10:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:         llama.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 50304
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = olmo
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,50304]   = ["<|endoftext|>", "<|padding|>", "!",...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,50304]   = [3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,50009]   = ["Ġ Ġ", "Ġ t", "Ġ a", "h e", "i n...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   33 tensors
llama_model_loader: - type  f16:  114 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 190.31 MiB (16.00 BPW) 
load: special tokens cache size = 25
load: token to piece cache size = 0.2984 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 2048
print_info: n_embd           = 512
print_info: n_layer          = 16
print_info: n_head           = 8
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1280
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 2048
print_info: rope_finetuned   = unknown
print_info: model type       = 1B
print_info: model params     = 99.76 M
print_info: general.name     = FloatLM_99M
print_info: vocab type       = BPE
print_info: n_vocab          = 50304
print_info: n_merges         = 50009
print_info: BOS token        = 0 '<|endoftext|>'
print_info: EOS token        = 0 '<|endoftext|>'
print_info: EOT token        = 0 '<|endoftext|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: LF token         = 187 'Ċ'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =   190.31 MiB
...................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (512) < n_ctx_train (2048) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_unified:        CPU KV buffer size =    64.00 MiB
llama_kv_cache_unified: size =   64.00 MiB (   512 cells,  16 layers,  4/4 seqs), K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_context:        CPU compute buffer size =   100.25 MiB
llama_context: graph nodes  = 566
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 81.45 ms
compute_imatrix: computing over 25 chunks, n_ctx=512, batch_size=2048, n_seq=4
compute_imatrix: 2.39 seconds per pass - ETA 0.23 minutes
[1]16.7510,[2]20.4547,[3]22.9985,[4]27.5738,[5]25.2990,[6]21.3834,[7]26.8647,[8]27.1727,
save_imatrix: saving imatrix using GGUF format with a different suffix than .gguf
save_imatrix: if you want the previous imatrix format, use --output-format dat
[9]25.6437,[10]21.1753,[11]23.6970,[12]27.4976,[13]27.4642,[14]32.5226,[15]32.2525,[16]32.0059,
save_imatrix: saving imatrix using GGUF format with a different suffix than .gguf
save_imatrix: if you want the previous imatrix format, use --output-format dat
[17]32.7918,[18]33.6250,[19]30.1082,[20]28.4703,[21]27.8973,[22]27.9287,[23]27.4216,[24]27.0536,[25]27.1781,
Final estimate: PPL = 27.1781 +/- 1.82304

save_imatrix: saving imatrix using GGUF format with a different suffix than .gguf
save_imatrix: if you want the previous imatrix format, use --output-format dat

llama_perf_context_print:        load time =    2570.79 ms
llama_perf_context_print: prompt eval time =   14108.60 ms / 12800 tokens (    1.10 ms per token,   907.25 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   14615.49 ms / 12801 tokens
llama_perf_context_print:    graphs reused =          0

Make sure to read the contributing guidelines before submitting a PR

@compilade compilade requested a review from CISC August 4, 2025 19:32
@CISC CISC merged commit 19f68fa into master Aug 4, 2025
45 of 47 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
…org#15076)

* imatrix : add warning when suffix is not .gguf for GGUF imatrix

* imatrix : only warn about suffix when output format is unspecified
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants