Update OLMoE model card (#39344)

nlhmnlhmnlhm · stevhliu · web-flow · commit fbeaf96f9e22 · 2025-07-21T16:41:01.000-07:00
* Update OLMoE model card

* Checks Test

* Add license and code

* Update docs/source/en/model_doc/olmoe.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update olmoe.md

---------

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;
diff --git a/docs/source/en/model_doc/olmoe.md b/docs/source/en/model_doc/olmoe.md
@@ -14,27 +14,89 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# OLMoE
-
+<div style="float: right;">
 <div class="flex flex-wrap space-x-1">
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
 <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
+</div>
+
+# OLMoE
+
+[OLMoE](https://huggingface.co/papers/2409.02060) is a sparse Mixture-of-Experts (MoE) language model with 7B parameters but only 1B parameters are used per input token. It has similar inference costs as dense models but trains ~3x faster. OLMoE uses fine-grained routing with 64 small experts in each layer and uses a dropless token-based routing algorithm.
+
+You can find all the original OLMoE checkpoints under the [OLMoE](https://huggingface.co/collections/allenai/olmoe-november-2024-66cf678c047657a30c8cd3da) collection.
+
+> [!TIP]
+> This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
+>
+> Click on the OLMoE models in the right sidebar for more examples of how to apply OLMoE to different language tasks.
+
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipe = pipeline(
+    task="text-generation",
+    model="allenai/OLMoE-1B-7B-0125",
+    torch_dtype=torch.float16,
+    device=0,
+)
+
+result = pipe("Dionysus is the god of")
+print(result)
+```
+
+</hfoption>
+<hfoption id="AutoModel">
+
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto").to(device)
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
+
+inputs = tokenizer("Bitcoin is", return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+output = model.generate(**inputs, max_length=64)
+print(tokenizer.decode(output[0]))
+```
 
-## Overview
+## Quantization
 
-The OLMoE model was proposed in [OLMoE: Open Mixture-of-Experts Language Models](https://huggingface.co/papers/2409.02060) by Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi.
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
 
-OLMoE is a series of **O**pen **L**anguage **Mo**dels using sparse **M**ixture-**o**f-**E**xperts designed to enable the science of language models. We release all code, checkpoints, logs, and details involved in training these models.
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 
-The abstract from the paper is the following:
+device = "cuda" if torch.cuda.is_available() else "cpu"
 
-*We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.*
+quantization_config = BitsAndBytesConfig(
+   load_in_4bit=True,
+   bnb_4bit_compute_dtype=torch.float16,
+   bnb_4bit_use_double_quant=True,
+   bnb_4bit_quant_type="nf4"
+)
 
-This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
-The original code can be found [here](https://github.com/allenai/OLMoE).
+model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto", quantization_config=quantization_config).to(device)
+tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
 
+inputs = tokenizer("Bitcoin is", return_tensors="pt")
+inputs = {k: v.to(device) for k, v in inputs.items()}
+output = model.generate(**inputs, max_length=64)
+print(tokenizer.decode(output[0]))
+```
 
 ## OlmoeConfig