Skip to content

Commit c0f9dc0

Browse files
authored
update (#1370)
* update Signed-off-by: Qubitium <Qubitium@modelcloud.ai> * update Signed-off-by: Qubitium <Qubitium@modelcloud.ai> --------- Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
1 parent 1c788b0 commit c0f9dc0

File tree

4 files changed

+15
-9
lines changed

4 files changed

+15
-9
lines changed

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,13 @@
1515
</p>
1616

1717
## News
18-
* 2/22/2025 2.0.0-dev: 🎉 `GPTQ` quantization internals are now broken into multiple stages (processes) for feature expansion. Synced `Marlin` kernel inference quality fix from upstream. Added `MARLIN_FP16`, lower-quality but faster, backend. `ModelScope` support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixed `generation_config.json` save and load. Fix Transformers v4.49.0 compat. Fixed compat of models without `bos`. Fixed `group_size=-1` and `bits=3` packing regression. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to [LogBar](https://github.com/modelcloud/logbar) pkg.
18+
* 03/03/2025 [2.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.0.0): 🎉 `GPTQ` quantization internals are now broken into multiple stages (processes) for feature expansion.
19+
Synced `Marlin` kernel inference quality fix from upstream. Added `MARLIN_FP16`, lower-quality but faster backend.
20+
`ModelScope` support added. Logging and cli progress bar output has been revamped with sticky bottom progress.
21+
Fixed `generation_config.json` save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without `bos`. Fixed `group_size=-1` and `bits=3` packing regression.
22+
Fixed Qwen 2.5 MoE regressions.
23+
Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to [LogBar](https://github.com/modelcloud/logbar) pkg.
24+
Fix ROCm version auto detection in `setup` install.
1925
* 02/12/2025 [1.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.9.0): ⚡ Offload `tokenizer` fixes to [Toke(n)icer](https://github.com/modelcloud/tokenicer) pkg. Optimized `lm_head` quant time and vram usage.
2026
Optimized `DeepSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.optimize()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`.
2127
* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): ⚡ `DeepSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes.

gptqmodel/utils/importer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@
4545

4646
AUTO_SELECT_BACKEND_ORDER = OrderedDict({
4747
BACKEND.MARLIN: MarlinQuantLinear, # optimized for bs > 1
48-
BACKEND.EXLLAMA_EORA: ExllamaEoraQuantLinear, #
48+
# BACKEND.EXLLAMA_EORA: ExllamaEoraQuantLinear, #
4949
BACKEND.EXLLAMA_V2: ExllamaV2QuantLinear, # optimized for bs > 1
5050
BACKEND.EXLLAMA_V1: ExllamaQuantLinear, # optimized for bs == 1
5151
BACKEND.TRITON: TritonV2QuantLinear, # good all around kernel that JIT compiles
@@ -56,7 +56,7 @@
5656
})
5757

5858
FORMAT_DICT = {
59-
FORMAT.GPTQ: [BACKEND.MARLIN, BACKEND.EXLLAMA_V2, BACKEND.EXLLAMA_V1, BACKEND.EXLLAMA_EORA, BACKEND.TRITON, BACKEND.CUDA, BACKEND.IPEX, BACKEND.TORCH, BACKEND.MARLIN_FP16],
59+
FORMAT.GPTQ: [BACKEND.MARLIN, BACKEND.EXLLAMA_V2, BACKEND.EXLLAMA_V1, BACKEND.TRITON, BACKEND.CUDA, BACKEND.IPEX, BACKEND.TORCH, BACKEND.MARLIN_FP16, BACKEND.EXLLAMA_EORA],
6060
FORMAT.GPTQ_V2: [BACKEND.EXLLAMA_V2, BACKEND.EXLLAMA_V1, BACKEND.TRITON, BACKEND.CUDA, BACKEND.TORCH],
6161
FORMAT.MARLIN: [BACKEND.MARLIN, BACKEND.MARLIN_FP16],
6262
FORMAT.BITBLAS: [BACKEND.BITBLAS],

gptqmodel/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17-
__version__ = "2.0.0-dev"
17+
__version__ = "2.0.0"

tests/test_lora.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929

3030
class Test(ModelTest):
3131
NATIVE_MODEL_ID = "/monster/data/model/sliuau-llama3.2-1b-4bit-group128"
32-
lora_path = "/monster/data/model/sliuau-llama3.2-1b-4bit-group128/llama3.2-1b-4bit-group128-eora-rank128-arc/adapter_model.safetensors" #"sliuau/llama3.2-1b-4bit-group128-eora_test-rank128-arc/blob/main/adapter_model.safetensors" #"sliuau/llama3.2-1b-4bit-group128-eora_test-rank128-arc"
32+
lora_path = "/monster/data/model/sliuau-llama3.2-1b-4bit-group128/llama3.2-1b-4bit-group128-eora-rank128-arc" #"sliuau/llama3.2-1b-4bit-group128-eora_test-rank128-arc/blob/main/adapter_model.safetensors" #"sliuau/llama3.2-1b-4bit-group128-eora_test-rank128-arc"
3333

3434
NATIVE_ARC_CHALLENGE_ACC = 0.3567
3535
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3805
@@ -45,8 +45,8 @@ def setUpClass(cls):
4545
# BACKEND.CUDA,
4646
# BACKEND.TRITON,
4747
# BACKEND.EXLLAMA_V1,
48-
BACKEND.EXLLAMA_V2,
49-
# BACKEND.MARLIN,
48+
# BACKEND.EXLLAMA_V2,
49+
BACKEND.MARLIN,
5050
# # (BACKEND.IPEX), <-- not tested yet
5151
# # (BACKEND.BITBLAS, <-- not tested yet
5252
])
@@ -65,10 +65,10 @@ def test_load(self, backend: BACKEND):
6565
self.assertIn("paris", result.lower())
6666

6767
@parameterized.expand([
68-
BACKEND.EXLLAMA_V2,
68+
BACKEND.MARLIN,
6969
])
7070
def test_download(self, backend: BACKEND):
71-
adapter = Lora(path="https://huggingface.co/sliuau/llama3.2-1b-4bit-group128-eora-rank128-arc/blob/main/adapter_model.safetensors", rank=128)
71+
adapter = Lora(path="sliuau/llama3.2-1b-4bit-group128-eora-rank128-arc", rank=128)
7272

7373
model = GPTQModel.load(
7474
self.NATIVE_MODEL_ID,

0 commit comments

Comments
 (0)