You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tool takes a GGUF input model file, typically in a high-precision format like F32 or BF16, and converts it to a quantized format.
4
+
Quantization reduces the precision of model weights (e.g., from 32-bit floats to 4-bit integers), which shrinks the model's size and can speed up inference.
5
+
This process however, may introduce some accuracy loss which is usually measured in [Perplexity](https://huggingface.co/docs/transformers/en/perplexity) (ppl) and/or [Kullback–Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) (kld).
6
+
This can be minimized by using a suitable imatrix file.
7
+
3
8
You can also use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to build your own quants without any setup.
4
9
5
10
Note: It is synced from llama.cpp `main` every 6 hours.
6
11
7
12
Example usage:
8
13
14
+
```./llama-quantize [options] input-model-f32.gguf [output-model-quant.gguf] type [threads]```
15
+
9
16
```bash
10
-
# obtain the official LLaMA model weights and place them in ./models
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p "You are a helpful assistant"
38
45
```
39
46
40
-
When running the larger models, make sure you have enough disk space to store all the intermediate files.
47
+
Options:
48
+
*`--allow-requantize` allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
49
+
*`--leave-output-tensor` will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
50
+
*`--pure` disables k-quant mixtures and quantizes all tensors to the same type
51
+
*`--imatrix` uses data in file generated by `llama-imatrix` as importance matrix for quant optimizations (highly recommended)
52
+
*`--include-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--exclude-weights`
53
+
*`--exclude-weights` use an importance matrix for tensor(s) in the list. Cannot be used with `--include-weights`
54
+
*`--output-tensor-type` use a specific quant type for the output.weight tensor
55
+
*`--token-embedding-type` use a specific quant type for the token embeddings tensor
56
+
*`--keep-split` will generate the quantized model in the same shards as the input file otherwise it will produce a single quantized file
57
+
58
+
Advanced options:
59
+
*`--tensor-type` quantize specific tensor(s) to specific quant types. Supports regex syntax. May be specified multiple times.
60
+
*`--prune-layers` prune (remove) the layers in the list
61
+
*`--override-kv` option to override model metadata by key in the quantized model. May be specified multiple times
62
+
63
+
Examples:
64
+
65
+
```bash
66
+
# naive Q4_K_M quantization using default settings and 8 CPU threads. Output will be "ggml-model-Q4_K_M.gguf"
67
+
./llama-quantize input-model-f32.gguf q4_k_m 8
68
+
```
69
+
70
+
```bash
71
+
# quantize model enabling re-quantization, leaving the output tensor unquantized and all others quantized at the same level (Q4_K)
# override expert used count metadata to 16, prune layers 20, 21, and 22 without quantizing the model (copy tensors) and use specified name for the output file
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
102
+
When running the larger models, make sure you have enough disk space to store all the intermediate files.
103
+
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same. For exmaple (Llama 3.1):
104
+
105
+
| Model | Original size | Quantized size (Q4_K_M) |
0 commit comments