imatrix : use GGUF to store importance matrices #9400

compilade · 2024-09-10T02:14:44Z

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.in_sum2 and *.counts for each tensors with imatrix data.
  - *.in_sum2 are the per-channel sums of squared activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used by llama-quantize)
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
~~Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf~~
- Conversion is either not necessary (since llama-quantize can still read the old format (with a warning)) or can be converted with llama-imatrix directly (when the output file has the .gguf suffix).
Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
~~Use fused-multiply-add (with std::fma) when accumulating the sums of activations~~
- (decided against using it for now, for easier comparisons with llama-imatrix from master)
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but might need to re-test. The resulting quantized model(s) should have the same sha256sum.
Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
Test --in-file with llama-imatrix
- single .imatrix or .gguf imatrix (for round-trip conversions)
- multiple (for merging)
(maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

I have read the contributing guidelines
Self-reported review complexity:
- Medium

* perplexity : simplify filling the batch

examples/imatrix/imatrix.cpp

Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

compilade · 2024-09-13T03:16:15Z

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

ggerganov · 2025-04-08T07:59:37Z

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

saood06 · 2025-07-08T04:26:53Z

Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.

Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)

The solution in @nicoboss's fork was inspired by ikawrakow/ik_llama.cpp#202 which does mention this concern (and to me seems to agree with the approach taken here):

Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for IQ1_S_R4 as I did in ikawrakow/ik_llama.cpp#191. So, for now we go with same-importance columns for never activated experts.

EAddario · 2025-07-12T09:17:33Z

Really looking forward to this PR being merged into master!

In the meantime, you may already know this but passing along a tip shared by @David-AU-github in here that has worked for me when dealing with imatrices with partial activations in MoEs: increase the model's number of active experts (if KV override is supported), then calib / imatrix.

Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used.

compilade · 2025-07-12T20:54:26Z

To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.

save_imatrix: saving to legacy imatrix format because output suffix is not .gguf

I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct imatrix.gguf combining and weighting missing data neutrally).

And to make the old format a bit more equivalent in quality to the new format (except when combining multiple imatrix.dat files with --in-file), I've made it write 1 values where the evaluation count is zero, a bit like in nicoboss's fork, but without modifying the internal data (and so intermediate saving will not affect the final result). (this is different than my previous stance in the last paragraph of #9400 (comment), because I realized dropping data would also affect combining imatrix files, and since most people don't combine imatrix files, then having the same behavior as the new format in the most common use case is saner)

I've also removed the need to load a model when converting between formats (it was already kind of like this when combining imatrix files), and so the following should be possible:

Warning

The syntax has changed in #14842

$ ./bin/llama-imatrix --in-file imatrix.dat -o imatrix.gguf
$ ./bin/llama-imatrix --in-file imatrix.gguf -o imatrix-roundtrip.dat
$ ./bin/llama-imatrix --in-file imatrix-roundtrip.dat -o imatrix-roundtrip.gguf

Note that shape information for evaluation counts of MoE tensors is missing from legacy imatrix files, and so it will also be missing from the converted imatrix.gguf file, except if more data is provided or if it's merged with a fresh imatrix.gguf file of the same model. (it will still work with llama-quantize, even when the evaluation count shape is flattened; GGUF makes it easy to support that)

Preserving the shape of evaluation counts is partly why it's recommended to use .gguf for newly-generated imatrix files.

The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents). This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).

Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).

nicoboss · 2025-07-13T19:54:58Z

@compilade Thanks a lot for your hard work. I'm really looking forward to this PR getting merged! Everything is perfect now in my opinion.

To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.

Thanks a lot for listening to our feedback and adding this warning. This should be enough to warn users that accidentally still use the legacy imatrix.dat file format after this is merged.

I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct imatrix.gguf combining and weighting missing data neutrally).

Thanks a lot! This is super useful to judge the quality of the imatrix and imatrix dataset as a great imatrix dataset should cover more experts than a bad one (unfortunately are always cases where even the best imatrix dataset can't cover them all as training the router to eventually make use of all experts seem quite hard and so was not done well for all MoE models).

I've made it write 1 values where the evaluation count is zero, a bit like in nicoboss#1, but without modifying the internal data (and so intermediate saving will not affect the final result).

That's super cool. I hated how my patch broke intermediate saving both by them affecting the result and by them hiding how much experts are covered for future saves and why we had to disable intermediate saves. This is such an elegant solution!

I've also removed the need to load a model when converting between formats

I really appreciate and find it super cool how easily conversion between the legacy imatrix.dat and new imatrix.gguf file format is possible. Not only that you went out of your way to make booth backwards and forwards compatibility as good as possible for everyone.

The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents).

With there now being a warning if someone doesn’t specify the .gguf suffix I find this design choose acceptable especially given that you have valid backwards compatibility reasons for it to be this way.

This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).

I'm looking forward to that. I appreciate that you give everyone time to slowly adopt the new file format. Please just don't forget to eventually drop writing support for the legacy imatrix.dat file format and allow arbitrary imatrix file name suffix.

Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).

So the .gguf suffix is only forced at write time so the file can be rename afterwards as it will always try to load the file as GGUF first. That’s really nice. Not that there really is any reason to not have them end with .gguf given that they are GGUF files but great that we can name them any way we want.

EAddario · 2025-07-13T20:59:35Z

Echoing @nicoboss' sentiment. This is a very nice enhancement @compilade. Thank you.

I've been testing by running different permutations of options, including roundtrips on each test, and comparing the resulting stats. As far as I can tell, everything checks out!

The only, minor, observation is that when converting an existing file to the new format, a gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF' warning is displayed. Other than that, it works like a charm!

CISC · 2025-07-14T09:29:07Z

The only, minor, observation is that when converting an existing file to the new format, a gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF' warning is displayed. Other than that, it works like a charm!

Unavoidable, but a vast improvement from previous behaviour, see #14381. :)

ubergarm · 2025-07-16T18:13:33Z

I released a manline imatrix for Kimi-K2-Instruct using this PR here if anyone is looking for it given it is challenging to compute: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF?show_file_info=mainline%2Fimatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf (interestingly the imatrix.gguf shows up in the hf model tensor viewer)

Feel free to use it for cooking your own custom mainline quants. I also have a version for ik_llama.cpp if that is your thing.

Might be cool to look inside it with Ed's imatrix stats tool: #12718

Thanks!

CISC · 2025-07-17T06:58:46Z

@compilade Time to merge this (and adapt #12718 afterwards)?

EAddario · 2025-07-17T10:21:39Z

Assuming no additional changes on this PR, the enhanced version of #12718 is ready to go as soon as this one is merged

compilade · 2025-07-19T06:28:31Z

@compilade Time to merge this (and adapt #12718 afterwards)?

@CISC Sure. I hope I've tested enough edge cases. Will merge at 16:00 UTC on 2025-07-19 (in around 10 hours), to give some buffer for last-minute problems.

(sorry for the delayed reply; recently made changes to my home network, now got symmetric fiber Internet)

compilade · 2025-07-19T15:59:01Z

tools/imatrix/imatrix.cpp

-            e.values.resize(src1->ne[0], 0);
-            e.counts.resize(src1->ne[0], 0);
+            e.values.resize(src1->ne[0] * n_mat, 0);
+            e.counts.resize(n_mat, 0);


Just noticed it doesn't really make practical sense to store multiple counts for 3d tensors used with MUL_MAT, since they will always all be the same (and redundant).

That is, unless the same tensor is also used with MUL_MAT_ID, but I don't think there are such architectures yet.

I'm thinking of making it a single count instead, but it will be very easy to make that change backwards-compatible (since the loading code can already deal with flattened counts from converted legacy imatrix files), and so it could be done in a follow-up PR.

…n imatrix file (#12718) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of #9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>

…n imatrix file (ggml-org#12718) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of ggml-org#9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>

compilade added 8 commits August 20, 2024 15:17

imatrix : allow processing multiple chunks per batch

bce5464

* perplexity : simplify filling the batch

imatrix : fix segfault when using a single chunk per batch

347247a

imatrix : use GGUF to store imatrix data

3de9300

imatrix : fix conversion problems

c8ab6a3

Merge branch 'master' into compilade/imatrix-batched-chunks

3ad0603

imatrix : use FMA and sort tensor names

d19101c

py : add requirements for legacy imatrix convert script

503630e

perplexity : revert changes

9e6b0e9

compilade added 3 commits September 9, 2024 22:20

py : include imatrix converter requirements in toplevel requirements

894ed8d

imatrix : avoid using designated initializers in C++

efa9186

imatrix : remove unused n_entries

2217247

ngxson reviewed Sep 10, 2024

View reviewed changes

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

compilade and others added 2 commits September 10, 2024 11:51

quantize : use unused imatrix chunk_size with LLAMA_TRACE

2d79a70

compilade marked this pull request as draft September 13, 2024 03:11

compilade added 3 commits January 30, 2025 19:56

common : use GGUF for imatrix output by default

c7a32e7

Merge branch 'master' into compilade/imatrix-batched-chunks

db502dd

Merge branch 'master' into compilade/imatrix-batched-chunks

1be357d

compilade mentioned this pull request Apr 6, 2025

imatrix: add option to display importance score statistics for a given imatrix file #12718

Merged

compilade added 3 commits April 13, 2025 12:10

Merge branch 'master' into compilade/imatrix-batched-chunks

16202d6

imatrix : two-way conversion between old format and GGUF

a5165a6

convert : remove imatrix to gguf python script

635f945

compilade added 5 commits July 12, 2025 13:31

Merge branch 'master' into compilade/imatrix-batched-chunks

0ee322c

imatrix : add warning when legacy format is written

42423ec

imatrix : avoid loading model to convert or combine imatrix

183eeb5

imatrix : avoid using imatrix.dat in README

942c55c

ubergarm mentioned this pull request Jul 14, 2025

Model : Add support for Kimi-K2 #14654

Merged

Green-Sky mentioned this pull request Jul 15, 2025

Add imatrix support leejet/stable-diffusion.cpp#633

Open

compilade commented Jul 19, 2025

View reviewed changes

compilade merged commit 9008328 into master Jul 19, 2025
53 checks passed

EAddario added a commit to EAddario/llama.cpp that referenced this pull request Jul 19, 2025

Re-implement changes on top of ggml-org#9400

f7b1ab2

nicoboss mentioned this pull request Jul 20, 2025

Quantize bug: Ernie4.5 MoE 300B low-bit quantization crashes #14788

Closed

compilade mentioned this pull request Jul 24, 2025

imatrix : use GGUF by default #14842

Merged

EAddario mentioned this pull request Jul 26, 2025

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Open

This was referenced Jul 31, 2025

imatrix : fix 3d activation handling for hybrid and recurrent models #14994

Merged

Bug: imatrix quantization failing for nvidia Nemotron 49B v1.5 ikawrakow/ik_llama.cpp#659

Open

quantize : configurable neutral imatrix prior #15060

Draft

saood06 mentioned this pull request Aug 11, 2025

Bug: imatrix is now encapsulated in a GGUF file in mainline. ikawrakow/ik_llama.cpp#664

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

imatrix : use GGUF to store importance matrices #9400

imatrix : use GGUF to store importance matrices #9400

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

saood06 commented Jul 8, 2025 •

edited

Loading

Uh oh!

EAddario commented Jul 12, 2025

Uh oh!

compilade commented Jul 12, 2025 •

edited

Loading

Uh oh!

nicoboss commented Jul 13, 2025

Uh oh!

EAddario commented Jul 13, 2025

Uh oh!

CISC commented Jul 14, 2025

Uh oh!

ubergarm commented Jul 16, 2025 •

edited

Loading

Uh oh!

CISC commented Jul 17, 2025

Uh oh!

EAddario commented Jul 17, 2025

Uh oh!

compilade commented Jul 19, 2025

Uh oh!

compilade Jul 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

imatrix : use GGUF to store importance matrices #9400

imatrix : use GGUF to store importance matrices #9400

Uh oh!

Conversation

compilade commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

saood06 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EAddario commented Jul 12, 2025

Uh oh!

compilade commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicoboss commented Jul 13, 2025

Uh oh!

EAddario commented Jul 13, 2025

Uh oh!

CISC commented Jul 14, 2025

Uh oh!

ubergarm commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jul 17, 2025

Uh oh!

EAddario commented Jul 17, 2025

Uh oh!

compilade commented Jul 19, 2025

Uh oh!

compilade Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading

saood06 commented Jul 8, 2025 •

edited

Loading

compilade commented Jul 12, 2025 •

edited

Loading

ubergarm commented Jul 16, 2025 •

edited

Loading

compilade Jul 19, 2025 •

edited

Loading