convert : handle pre-quantized models #14810

compilade · 2025-07-22T08:31:45Z

Should fix #14762, and also address #3353.

This roughly implements the idea in #14762 (comment) to allow converting from pre-quantized models, by splitting ModelBase.get_tensors into an intermediate ModelBase.index_tensors which returns a dict[str, Callable[[], Tensor]], which can be modified before get_tensors is called. get_tensors still keeps the same signature (it still returns an Iterator[tuple[str, Tensor]]).

For now, support for these pre-quantizations has been implemented:

bitnet
- https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens
  - works
- https://huggingface.co/codys12/Qwen3-8B-BitNet
  - not really supported (yet) because of the extra RMS norms; will be possible in a separate PR
- https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
  - should be possible to support once the architecture is also added
fp8
- https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B-FP8
  - has relatively high perplexity, not sure why
- https://huggingface.co/Qwen/Qwen3-4B-FP8
  - has good relatively low perplexity, so I assume this works
gptq (only 2, 4 and 8 bits, not 3)
- https://huggingface.co/AlignQuant/Meta-Llama-3-8B-Instruct-GPTQ-2bit
  - high perplexity (in the hundreads), but I guess that's normal for 2-bit GPTQ? (maybe not...)
- https://huggingface.co/eewer/Qwen3-0.6B-4bit-GPTQ
  - works
- https://huggingface.co/reinattwijaya/Qwen3-0.6B-final-noreason-gptq-8bit-c4
  - works

The 3-bit variant of GPTQ is more complicated, and so was omitted for now.

Notes

This removes ModelBase.tensor_names in favor of self.model_tensors which also allows getting the tensor data (because it's a dict[str, Callable[[], Tensor]].
CodeShell had a workaround in llama : add support for GPT2, Bloom and CodeShell tied word embeddings #12456 (by @CISC) which removed the lm_head.weight tensor from self.tensor_names, but I don't see why it's necessary. I've removed it because self.tensor_names was also removed.
- I've converted https://huggingface.co/WisdomShell/CodeShell-7B-Chat without problem
- I've also ran a --dry-run of a --remote conversion of https://huggingface.co/WisdomShell/CodeShell-7B, which does run the tensor index completeness check. No problem either.
Lambda functions in Python don't quite capture their environment, so I've used the default-parameter trick from https://stackoverflow.com/a/2295372 a lot.

TODO

Test if this causes memory usage regressions
- Lazy or not, safetensors or not
- So far it seems good.
Test remote conversion (with --remote)

Make sure to read the contributing guidelines before submitting a PR

ggerganov

Perfect!

In case you feel like it, add support for MXFP4 as well. I will be upstreaming a ggml implementation soon and it would be nice to have HF conversion support. You can use some of the smaller models here https://huggingface.co/models?sort=created&search=mxfp4 (anyone without hadamard matrices should work).

CISC · 2025-07-22T10:15:56Z

In case you were wondering, the workaround was about this line:
https://huggingface.co/WisdomShell/CodeShell-7B-Chat/blob/main/pytorch_model.bin.index.json#L555

Conversion would fail because that tensor doesn't exist.

Edit: I might have used the safetensors version, could be it actually works with pytorch bins.

convert : begin handling pre-quantized models

de12f8a

compilade added enhancement New feature or request python python script changes labels Jul 22, 2025

compilade mentioned this pull request Jul 22, 2025

Feature Request: Direct FP8 conversion from convert_hf_to_gguf.py #14762

Closed

4 tasks

ggerganov approved these changes Jul 22, 2025

View reviewed changes

CISC approved these changes Jul 22, 2025

View reviewed changes

CISC mentioned this pull request Jul 29, 2025

model: Add support for GLM 4.5 family of models (#14921) #14939

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : handle pre-quantized models #14810

convert : handle pre-quantized models #14810

compilade commented Jul 22, 2025

Uh oh!

ggerganov left a comment

Uh oh!

CISC commented Jul 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

convert : handle pre-quantized models #14810

Are you sure you want to change the base?

convert : handle pre-quantized models #14810

Conversation

compilade commented Jul 22, 2025

Notes

TODO

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

CISC commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 22, 2025 •

edited

Loading