Skip to content

Commit 63499e1

Browse files
authored
Update README.md (#1237)
1 parent 3ad67a0 commit 63499e1

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): ⚡ `DeekSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes.
1616
`Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`.
1717
New `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom.
18-
New `GPTQModel.push_to_hub()` api for easy quant model to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models.
19-
Fixed `bits=3` packing, and `group_size=-1` regression in v1.7.4.
18+
New `GPTQModel.push_to_hub()` api for easy quant model upload to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models.
19+
Fixed `bits=3` packing and `group_size=-1` regression in v1.7.4.
2020
* 01/26/2025 [1.7.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4): New `compile()` api for ~4-8% inference tps improvement. Faster `pack()` for post-quantiztion model save. `Triton` kernel validated for Intel/`XPU` when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save.
2121
* 01/20/2025 [1.7.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.3): New Telechat2 (China Telecom) and PhiMoE model support. Fixed `lm_head` weights duplicated in post-quantize save() for models with tied-embedding.
2222
* 01/19/2025 [1.7.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.2): Effective BPW (bits per weight) will now be logged during `load()`. Reduce loading time on Intel Arc A770/B580 `XPU` by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version.

0 commit comments

Comments
 (0)