-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Description
Feature request
Adding support for SINQ quantization for Hugging Face compatible models, enabling users to apply it directly through the configuration settings. The SINQ quantization method, recently introduced in the paper SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights, has quickly gained significant attention. It demonstrates superior effectiveness compared to existing approaches such as HQQ, while also offering substantially faster quantization times.
Motivation
Integrating the SINQ quantization algorithm into the Transformers library (as already done for HQQ, AWQ, HIGGS, ...) would allow users to easily quantize models by simply specifying the desired quantization method and parameters within the configuration settings, substituting the need to consult and directly use custom code from the SINQ repository. This integration aims to streamline and simplify the quantization process while leveraging the existing features and infrastructure of the Transformers library.
Your contribution
I’m going to submit a pull request that includes the implementation and testing of the SINQ quantization integration. This integration enables users to specify the quantization method directly through the configuration, as shown below:
cfg = SinqConfig(
nbits=4,
group_size=64,
tiling_mode="1D",
method="sinq",
dtype="auto",
modules_to_not_convert=["lm_head"],
device="cuda:1"
)Once the configuration is defined, the model can be quantized simply by calling the from_pretrained() function with the specified configuration settings.