Quantization Fundamentals with Hugging Face

Welcome to the repository for the Quantization Fundamentals with Hugging Face course, brought to you by DeepLearning.AI in collaboration with Hugging Face. 🎉

This repo contains the notebooks and explanations used in the course to help you understand how to make Large Language Models (LLMs) faster, lighter, and more efficient through quantization.

📖 About the Course

Large deep learning models are powerful but often too big to run efficiently. Quantization offers a practical way to reduce memory, storage, and compute requirements while keeping model performance intact.

By the end of this course, you will:

✅ Understand how to work with big models
✅ Learn about data types and number representations in deep learning
✅ Explore how to load models efficiently with different data types
✅ Grasp the theory of quantization and why it works
✅ Apply quantization to LLMs for real-world usage

📚 Course Topics

1. Handling Big Models

Large models create memory and compute challenges.

Weights, activations, optimizer state all contribute to memory usage.
Techniques: gradient checkpointing, offloading to CPU/NVMe, model parallelism, ZeRO optimization.
Precision reduction (FP16, INT8) significantly lowers memory.
Example: A 7B parameter model takes ~28GB in FP32, ~14GB in FP16, ~7GB in INT8.

2. Data Types and Sizes

Quantization depends on how numbers are represented.

FP32: 32-bit float, high precision, high memory.
FP16 / BF16: 16-bit float formats, half the storage, BF16 has better stability.
INT8 / INT4: integer formats, require mapping floats to integers using scale and zero_point.
Per-tensor vs per-channel quantization: per-channel usually reduces errors better for weights.

3. Loading Models by Data Type

Efficient model loading saves memory and computation.

Load directly in FP16/BF16 using torch_dtype in Hugging Face.
Load in INT8/INT4 using quantization libraries (load_in_8bit=True).
Use mixed precision (keep sensitive layers like LayerNorm in FP16, others quantized).
device_map can spread model layers across multiple GPUs or CPUs for large models.

4. Quantization Theory

Quantization approximates floating-point numbers with integers.

Uniform affine quantization: x ≈ scale * (q - zero_point)
Symmetric vs asymmetric: symmetric (zero_point=0) works for weights; asymmetric better for activations.
PTQ (Post-Training Quantization): quantize after training, fast but may lose accuracy.
QAT (Quantization-Aware Training): simulate quantization during training to preserve accuracy.
Calibration: Use a representative dataset to set activation ranges and avoid poor scaling.

5. Quantization of LLMs

Applying quantization to transformer-based LLMs requires care.

Quantize weights (INT8/INT4) → major memory savings.
Keep sensitive layers in FP16/BF16 (LayerNorm, softmax, embeddings).
Steps: baseline evaluation → choose precision → PTQ with calibration → evaluate → refine (per-channel, mixed precision) → deploy.
Pitfalls: bad calibration, quantizing sensitive ops, skipping evaluation.
Results: INT8 weights cut memory 4×, INT4 cuts 8×, with little accuracy loss if done properly.

📚 Recommended Resources

⭐ Acknowledgements

This course is created by DeepLearning.AI and Hugging Face, with contributions from the open-source community. 🚀

✨ Star this repo if you find it useful and keep exploring quantization! ✨

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantization Fundamentals with Hugging Face

📖 About the Course

📚 Course Topics

1. Handling Big Models

2. Data Types and Sizes

3. Loading Models by Data Type

4. Quantization Theory

5. Quantization of LLMs

📚 Recommended Resources

⭐ Acknowledgements

About

Uh oh!

Releases

Packages

Languages

sdivyanshu90/Quantization-Fundamentals-with-Hugging-Face

Folders and files

Latest commit

History

Repository files navigation

Quantization Fundamentals with Hugging Face

📖 About the Course

📚 Course Topics

1. Handling Big Models

2. Data Types and Sizes

3. Loading Models by Data Type

4. Quantization Theory

5. Quantization of LLMs

📚 Recommended Resources

⭐ Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages