Skip to content

Commit 7f97599

Browse files
EAddarioCISC
andauthored
quantize : update README.md (#14905)
* Update README.md * Fix trailing whitespace * Update README.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
1 parent bf78f54 commit 7f97599

File tree

1 file changed

+113
-71
lines changed

1 file changed

+113
-71
lines changed

tools/quantize/README.md

Lines changed: 113 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,25 @@
11
# quantize
22

3+
This tool takes a GGUF input model file, typically in a high-precision format like F32 or BF16, and converts it to a quantized format.
4+
Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), which shrinks the model's size and can speed up inference.
5+
This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
6+
This can be minimized by using a suitable imatrix file.
7+
38
You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
49

510
Note: It is synced from llama.cpp `main` every 6 hours.
611

712
Example usage:
813

14+
```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
15+
916
```bash
10-
# obtain the official LLaMA model weights and place them in ./models
11-
ls ./models
12-
llama-2-7b tokenizer_checklist.chk tokenizer.model
13-
# [Optional] for models using BPE tokenizers
17+
# from Hugginface, obtain the official meta-llama/Llama-3.1-8B model weights and place them in ./models
1418
ls ./models
15-
<folder containing weights and tokenizer json> vocab.json
19+
config.json model-00001-of-00004.safetensors model-00004-of-00004.safetensors README.md tokenizer.json
20+
generation_config.json model-00002-of-00004.safetensors model.safetensors.index.json special_tokens_map.json USE_POLICY.md
21+
LICENSE model-00003-of-00004.safetensors original tokenizer_config.json
22+
1623
# [Optional] for PyTorch .bin models like Mistral-7B
1724
ls ./models
1825
<folder containing weights and tokenizer json>
@@ -21,7 +28,7 @@ ls ./models
2128
python3 -m pip install -r requirements.txt
2229

2330
# convert the model to ggml FP16 format
24-
python3 convert_hf_to_gguf.py models/mymodel/
31+
python3 convert_hf_to_gguf.py ./models/mymodel/
2532

2633
# quantize the model to 4-bits (using Q4_K_M method)
2734
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
@@ -37,40 +44,117 @@ Run the quantized model:
3744
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
3845
```
3946

40-
When running the larger models, make sure you have enough disk space to store all the intermediate files.
47+
Options:
48+
* `--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
49+
* `--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
50+
* `--pure` disables k-quant mixtures and quantizes all tensors to the same type
51+
* `--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
52+
* `--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
53+
* `--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
54+
* `--output-tensor-type` use a specific quant type for the output.weight tensor
55+
* `--token-embedding-type` use a specific quant type for the token embeddings tensor
56+
* `--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
57+
58+
Advanced options:
59+
* `--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
60+
* `--prune-layers` prune (remove) the layers in the list
61+
* `--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
62+
63+
Examples:
64+
65+
```bash
66+
# naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"
67+
./llama-quantize input-model-f32.gguf q4_k_m 8
68+
```
69+
70+
```bash
71+
# quantize model enabling re-quantization, leaving the output tensor unquantized and all others quantized at the same level (Q4_K)
72+
./llama-quantize --allow-requantize --leave-output-tensor --pure input-model-f32.gguf q4_k_m 8
73+
```
74+
75+
```bash
76+
# quantize model using an importance matrix for specified tensors only (attn_v and ffn_down)
77+
./llama-quantize --imatrix imatrix.gguf --include-weights attn_v --include-weights ffn_down input-model-f32.gguf q4_k_m 8
78+
```
79+
80+
```bash
81+
# quantize model setting output tensor to Q5_K_M, token embeddings to Q3_K_M, and keeping the input file's shards
82+
./llama-quantize --imatrix imatrix.gguf --output-tensor-type q5_k --token-embedding-type q3_k --keep-split input-model-f32.gguf q4_k_m 8
83+
```
84+
85+
```bash
86+
# quantize model using a regex to quantize attn_k tensors in odd layers to Q5_K_M and attn_q tensors in even layers to Q3_K_M
87+
./llama-quantize --imatrix imatrix.gguf --tensor-type "\.(\d*[13579])\.attn_k=q5_k" --tensor-type "\.(\d*[02468])\.attn_q=q3_k" input-model-f32.gguf q4_k_m 8
88+
```
89+
90+
```bash
91+
# quantize model setting tensors attn_v and ffn_down to Q5_K_M and pruning layers 20, 21, and 22
92+
./llama-quantize --imatrix imatrix.gguf --tensor-type attn_v=q5_k --tensor-type ffn_down=q5_k --prune-layers 20,21,22 input-model-f32.gguf q4_k_m 8
93+
```
94+
95+
```bash
96+
# override expert used count metadata to 16, prune layers 20, 21, and 22 without quantizing the model (copy tensors) and use specified name for the output file
97+
./llama-quantize --imatrix imatrix.gguf --override-kv qwen3moe.expert_used_count=int:16 --prune-layers 20,21,22 input-model-f32.gguf pruned-model-f32.gguf copy 8
98+
```
4199

42100
## Memory/Disk Requirements
43101

44-
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
102+
When running the larger models, make sure you have enough disk space to store all the intermediate files.
103+
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same. For exmaple (Llama 3.1):
104+
105+
| Model | Original size | Quantized size (Q4_K_M) |
106+
| ----: | ------------: | ----------------------: |
107+
| 8B | 32.1 GB | 4.9 GB |
108+
| 70B | 280.9 GB | 43.1 GB |
109+
| 405B | 1,625.1 GB | 249.1 GB |
45110

46-
| Model | Original size | Quantized size (Q4_0) |
47-
|------:|--------------:|----------------------:|
48-
| 7B | 13 GB | 3.9 GB |
49-
| 13B | 24 GB | 7.8 GB |
50-
| 30B | 60 GB | 19.5 GB |
51-
| 65B | 120 GB | 38.5 GB |
52111

53112
## Quantization
54113

55-
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
114+
Several quantization methods are supported. They differ in the resulting model disk size and inference speed. For example,
115+
116+
### [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
117+
118+
| Measure | IQ1_S | IQ1_M | IQ2_XXS | IQ2_XS | IQ2_S | IQ2_M |
119+
| --------------------------- | ------------ | ------------ | ------------ | ------------- | ------------- | ------------ |
120+
| bits/weight | 2.0042 | 2.1460 | 2.3824 | 2.5882 | 2.7403 | 2.9294 |
121+
| size (GiB) | 1.87 | 2.01 | 2.23 | 2.42 | 2.56 | 2.74 |
122+
| prompt processing t/s @ 512 | 858.88 ±1.22 | 847.99 ±0.47 | 852.39 ±0.85 | 826.99 ±12.51 | 783.55 ±13.73 | 787.68 ±7.00 |
123+
| text generation t/s @ 128 | 79.73 ±0.79 | 72.92 ±0.14 | 79.86 ±0.22 | 78.04 ±0.46 | 77.30 ±2.47 | 74.44 ±0.15 |
124+
125+
| Measure | IQ3_XXS | IQ3_XS | IQ3_S | IQ3_M | IQ4_XS | IQ4_NL |
126+
| --------------------------- | ------------ | ------------ | ------------ | ------------- | ------------- | ------------ |
127+
| bits/weight | 3.2548 | 3.4977 | 3.6606 | 3.7628 | 4.4597 | 4.6818 |
128+
| size (GiB) | 3.04 | 3.27 | 3.42 | 3.52 | 4.17 | 4.38 |
129+
| prompt processing t/s @ 512 | 813.88 ±6.53 | 708.71 ±1.26 | 798.78 ±8.81 | 768.70 ±13.73 | 771.80 ±11.38 | 806.03 ±7.07 |
130+
| text generation t/s @ 128 | 73.95 ±0.20 | 71.67 ±0.54 | 69.31 ±0.63 | 70.15 ±0.33 | 77.51 ±0.20 | 76.63 ±0.28 |
56131

57-
*(outdated)*
58132

59-
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
60-
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
61-
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
62-
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
63-
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
64-
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
65-
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
66-
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
67-
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
68-
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
69-
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
70-
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
133+
| Measure | Q2_K_S | Q2_K | Q3_K_S | Q3_K_M | Q3_K_L | Q4_K_S |
134+
| --------------------------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
135+
| bits/weight | 2.9697 | 3.1593 | 3.6429 | 3.9960 | 4.2979 | 4.6672 |
136+
| size (GiB) | 2.78 | 2.95 | 3.41 | 3.74 | 4.02 | 4.36 |
137+
| prompt processing t/s @ 512 | 798.91 ±6.40 | 784.45 ±7.85 | 752.17 ±7.94 | 783.44 ±9.92 | 761.17 ±7.55 | 818.55 ±9.58 |
138+
| text generation t/s @ 128 | 90.01 ±0.12 | 79.85 ±0.20 | 69.84 ±0.18 | 71.68 ±0.22 | 69.38 ±0.49 | 76.71 ±0.20 |
139+
140+
| Measure | Q4_K_S | Q4_K_M | Q5_K_S | Q5_K_M | Q6_K | Q8_0 |
141+
| --------------------------- | ------------ | ------------- | ------------ | ------------ | ------------- | ------------ |
142+
| bits/weight | 4.6672 | 4.8944 | 5.5704 | 5.7036 | 6.5633 | 8.5008 |
143+
| size (GiB) | 4.36 | 4.58 | 5.21 | 5.33 | 6.14 | 7.95 |
144+
| prompt processing t/s @ 512 | 818.55 ±9.58 | 821.81 ±21.44 | 752.52 ±0.99 | 758.69 ±7.43 | 812.01 ±10.82 | 865.09 ±8.30 |
145+
| text generation t/s @ 128 | 76.71 ±0.20 | 71.93 ±1.52 | 69.53 ±0.18 | 67.23 ±1.08 | 58.67 ±3.13 | 50.93 ±0.08 |
146+
147+
| Measure | F16 |
148+
| --------------------------- | ------------ |
149+
| bits/weight | 16.0005 |
150+
| size (GiB) | 14.96 |
151+
| prompt processing t/s @ 512 | 923.49 ±0.53 |
152+
| text generation t/s @ 128 | 29.17 ±0.04 |
153+
154+
## Background information on llama-quantize
71155

72156
- [k-quants](https://github.com/ggml-org/llama.cpp/pull/1684)
73-
- recent k-quants improvements and new i-quants
157+
- k-quants improvements and i-quants
74158
- [#2707](https://github.com/ggml-org/llama.cpp/pull/2707)
75159
- [#2807](https://github.com/ggml-org/llama.cpp/pull/2807)
76160
- [#4773 - 2-bit i-quants (inference)](https://github.com/ggml-org/llama.cpp/pull/4773)
@@ -85,45 +169,3 @@ Several quantization methods are supported. They differ in the resulting model d
85169
- [#5060 - Q3_K_XS](https://github.com/ggml-org/llama.cpp/pull/5060)
86170
- [#5196 - 3-bit i-quants](https://github.com/ggml-org/llama.cpp/pull/5196)
87171
- [quantization tuning](https://github.com/ggml-org/llama.cpp/pull/5320), [another one](https://github.com/ggml-org/llama.cpp/pull/5334), and [another one](https://github.com/ggml-org/llama.cpp/pull/5361)
88-
89-
**Llama 2 7B**
90-
91-
| Quantization | Bits per Weight (BPW) |
92-
|--------------|-----------------------|
93-
| Q2_K | 3.35 |
94-
| Q3_K_S | 3.50 |
95-
| Q3_K_M | 3.91 |
96-
| Q3_K_L | 4.27 |
97-
| Q4_K_S | 4.58 |
98-
| Q4_K_M | 4.84 |
99-
| Q5_K_S | 5.52 |
100-
| Q5_K_M | 5.68 |
101-
| Q6_K | 6.56 |
102-
103-
**Llama 2 13B**
104-
105-
Quantization | Bits per Weight (BPW)
106-
-- | --
107-
Q2_K | 3.34
108-
Q3_K_S | 3.48
109-
Q3_K_M | 3.89
110-
Q3_K_L | 4.26
111-
Q4_K_S | 4.56
112-
Q4_K_M | 4.83
113-
Q5_K_S | 5.51
114-
Q5_K_M | 5.67
115-
Q6_K | 6.56
116-
117-
**Llama 2 70B**
118-
119-
Quantization | Bits per Weight (BPW)
120-
-- | --
121-
Q2_K | 3.40
122-
Q3_K_S | 3.47
123-
Q3_K_M | 3.85
124-
Q3_K_L | 4.19
125-
Q4_K_S | 4.53
126-
Q4_K_M | 4.80
127-
Q5_K_S | 5.50
128-
Q5_K_M | 5.65
129-
Q6_K | 6.56

0 commit comments

Comments
 (0)