-
Notifications
You must be signed in to change notification settings - Fork 12.7k
imatrix : use GGUF to store importance matrices #9400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* perplexity : simplify filling the batch
Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread). I'll need some days to think about how to go further with this. |
@compilade This is a good change and I think it would be useful to bring it to a completion. In the future, we can extend |
The solution in @nicoboss's fork was inspired by ikawrakow/ik_llama.cpp#202 which does mention this concern (and to me seems to agree with the approach taken here):
|
Really looking forward to this PR being merged into master! In the meantime, you may already know this but passing along a tip shared by @David-AU-github in here that has worked for me when dealing with imatrices with partial activations in MoEs: increase the model's number of active experts (if KV override is supported), then calib / imatrix. |
Also make the legacy format store partial data by using neutral values for missing data. This matches what is done at read-time for the new format, and so should get the same quality in case the old format is still used.
To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.
I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in And to make the old format a bit more equivalent in quality to the new format (except when combining multiple I've also removed the need to load a model when converting between formats (it was already kind of like this when combining imatrix files), and so the following should be possible: Warning The syntax has changed in #14842 $ ./bin/llama-imatrix --in-file imatrix.dat -o imatrix.gguf
$ ./bin/llama-imatrix --in-file imatrix.gguf -o imatrix-roundtrip.dat
$ ./bin/llama-imatrix --in-file imatrix-roundtrip.dat -o imatrix-roundtrip.gguf Note that shape information for evaluation counts of MoE tensors is missing from legacy imatrix files, and so it will also be missing from the converted Preserving the shape of evaluation counts is partly why it's recommended to use The forced suffix of Since the old format doesn't have a magic header, |
@compilade Thanks a lot for your hard work. I'm really looking forward to this PR getting merged! Everything is perfect now in my opinion.
Thanks a lot for listening to our feedback and adding this warning. This should be enough to warn users that accidentally still use the legacy
Thanks a lot! This is super useful to judge the quality of the imatrix and imatrix dataset as a great imatrix dataset should cover more experts than a bad one (unfortunately are always cases where even the best imatrix dataset can't cover them all as training the router to eventually make use of all experts seem quite hard and so was not done well for all MoE models).
That's super cool. I hated how my patch broke intermediate saving both by them affecting the result and by them hiding how much experts are covered for future saves and why we had to disable intermediate saves. This is such an elegant solution!
I really appreciate and find it super cool how easily conversion between the legacy
With there now being a warning if someone doesn’t specify the
I'm looking forward to that. I appreciate that you give everyone time to slowly adopt the new file format. Please just don't forget to eventually drop writing support for the legacy
So the |
Echoing @nicoboss' sentiment. This is a very nice enhancement @compilade. Thank you. I've been testing by running different permutations of options, including roundtrips on each test, and comparing the resulting stats. As far as I can tell, everything checks out! The only, minor, observation is that when converting an existing file to the new format, a |
Unavoidable, but a vast improvement from previous behaviour, see #14381. :) |
I released a manline imatrix for Kimi-K2-Instruct using this PR here if anyone is looking for it given it is challenging to compute: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF?show_file_info=mainline%2Fimatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf (interestingly the imatrix.gguf shows up in the hf model tensor viewer) Feel free to use it for cooking your own custom mainline quants. I also have a version for ik_llama.cpp if that is your thing. Might be cool to look inside it with Ed's imatrix stats tool: #12718 Thanks! |
@compilade Time to merge this (and adapt #12718 afterwards)? |
Assuming no additional changes on this PR, the enhanced version of #12718 is ready to go as soon as this one is merged |
@CISC Sure. I hope I've tested enough edge cases. Will merge at 16:00 UTC on 2025-07-19 (in around 10 hours), to give some buffer for last-minute problems. (sorry for the delayed reply; recently made changes to my home network, now got symmetric fiber Internet) |
e.values.resize(src1->ne[0], 0); | ||
e.counts.resize(src1->ne[0], 0); | ||
e.values.resize(src1->ne[0] * n_mat, 0); | ||
e.counts.resize(n_mat, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed it doesn't really make practical sense to store multiple counts for 3d tensors used with MUL_MAT
, since they will always all be the same (and redundant).
That is, unless the same tensor is also used with MUL_MAT_ID
, but I don't think there are such architectures yet.
I'm thinking of making it a single count instead, but it will be very easy to make that change backwards-compatible (since the loading code can already deal with flattened counts from converted legacy imatrix files), and so it could be done in a follow-up PR.
…n imatrix file (#12718) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of #9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>
…n imatrix file (ggml-org#12718) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of ggml-org#9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>
…n imatrix file (ggml-org#12718) * Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <git@compilade.net> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of ggml-org#9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <git@compilade.net> * Update README.md Co-authored-by: compilade <git@compilade.net> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <git@compilade.net>
Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).
Using GGUF as the format for
imatrix
files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind ofgguf-diff
, etc.).There are multiple problems with
imatrix
which this is addressing:unordered_map
iteration order (makessha256sum
useless to compareimatrix
files made on the same dataset)-ub
(intermediate saves happen waaay too often)Summary of changes
imatrix
data.general.type
isimatrix
general.architecture
imatrix
files.*.in_sum2
and*.counts
for each tensors with imatrix data.*.in_sum2
are the per-channel sums of squared activationsF32
, like before.*.counts
are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used byllama-quantize
)imatrix
files together with--in-file
.F32
even though it's integer values, because when calculating the mean it would be converted toF32
anyway to perform the division.Addconvert_legacy_imatrix_to_gguf.py
to convert oldimatrix.dat
files toimatrix.gguf
llama-quantize
can still read the old format (with a warning)) or can be converted withllama-imatrix
directly (when the output file has the.gguf
suffix).llama-perplexity
since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch withllama-imatrix
Use fused-multiply-add (withstd::fma
) when accumulating the sums of activationsllama-imatrix
frommaster
)f64
would be even better, but I'm not use it's worth it yet. For the curious, usingdouble
for the intermediate accumulations can be tried by changing only one line inIMatrixStats
:vector<float> values
tovector<double> values
.)unordered_map
.sha256sum
can be meaningfully used to compareimatrix
files generated in very similar conditions.TODO
llama-quantize
with oldimatrix.dat
with newllama-quantize
using convertedimatrix.gguf
sha256sum
.llama-imatrix
at different batch sizes-ub 64 -b 512
and-ub 512 -b 2048
for a chunk size of 512 (-c 512
)llama-imatrix
vs newllama-imatrix
--in-file
withllama-imatrix
.imatrix
or.gguf
imatrix (for round-trip conversions)general.architecture
exclusion.self.add_architecture()
a no-op, but maybegeneral.architecture
should simply be excluded whenself.arch == ""
. Not sure how to prevent using the otherself.add_*
(inGGUFWriter
) which expectself.arch
to be something.imatrix.dat
files?