Skip to content

Integration of the SINQ quantization strategy #42116

@ChiaraBoretti

Description

@ChiaraBoretti

Feature request

Adding support for SINQ quantization for Hugging Face compatible models, enabling users to apply it directly through the configuration settings. The SINQ quantization method, recently introduced in the paper SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights, has quickly gained significant attention. It demonstrates superior effectiveness compared to existing approaches such as HQQ, while also offering substantially faster quantization times.

Motivation

Integrating the SINQ quantization algorithm into the Transformers library (as already done for HQQ, AWQ, HIGGS, ...) would allow users to easily quantize models by simply specifying the desired quantization method and parameters within the configuration settings, substituting the need to consult and directly use custom code from the SINQ repository. This integration aims to streamline and simplify the quantization process while leveraging the existing features and infrastructure of the Transformers library.

Your contribution

I’m going to submit a pull request that includes the implementation and testing of the SINQ quantization integration. This integration enables users to specify the quantization method directly through the configuration, as shown below:

cfg = SinqConfig(
    nbits=4,
    group_size=64,
    tiling_mode="1D",
    method="sinq", 
    dtype="auto",
    modules_to_not_convert=["lm_head"],
    device="cuda:1"
)

Once the configuration is defined, the model can be quantized simply by calling the from_pretrained() function with the specified configuration settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions