Skip to content

Commit fbeaf96

Browse files
Update OLMoE model card (#39344)
* Update OLMoE model card * Checks Test * Add license and code * Update docs/source/en/model_doc/olmoe.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update olmoe.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 641aaed commit fbeaf96

File tree

1 file changed

+71
-9
lines changed

1 file changed

+71
-9
lines changed

docs/source/en/model_doc/olmoe.md

Lines changed: 71 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,27 +14,89 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# OLMoE
18-
17+
<div style="float: right;">
1918
<div class="flex flex-wrap space-x-1">
2019
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
2120
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
2221
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
2322
</div>
23+
</div>
24+
25+
# OLMoE
26+
27+
[OLMoE](https://huggingface.co/papers/2409.02060) is a sparse Mixture-of-Experts (MoE) language model with 7B parameters but only 1B parameters are used per input token. It has similar inference costs as dense models but trains ~3x faster. OLMoE uses fine-grained routing with 64 small experts in each layer and uses a dropless token-based routing algorithm.
28+
29+
You can find all the original OLMoE checkpoints under the [OLMoE](https://huggingface.co/collections/allenai/olmoe-november-2024-66cf678c047657a30c8cd3da) collection.
30+
31+
> [!TIP]
32+
> This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
33+
>
34+
> Click on the OLMoE models in the right sidebar for more examples of how to apply OLMoE to different language tasks.
35+
36+
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
37+
38+
<hfoptions id="usage">
39+
<hfoption id="Pipeline">
40+
41+
```py
42+
import torch
43+
from transformers import pipeline
44+
45+
pipe = pipeline(
46+
task="text-generation",
47+
model="allenai/OLMoE-1B-7B-0125",
48+
torch_dtype=torch.float16,
49+
device=0,
50+
)
51+
52+
result = pipe("Dionysus is the god of")
53+
print(result)
54+
```
55+
56+
</hfoption>
57+
<hfoption id="AutoModel">
58+
59+
```py
60+
import torch
61+
from transformers import AutoModelForCausalLM, AutoTokenizer
62+
63+
device = "cuda" if torch.cuda.is_available() else "cpu"
64+
65+
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto").to(device)
66+
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
67+
68+
inputs = tokenizer("Bitcoin is", return_tensors="pt")
69+
inputs = {k: v.to(device) for k, v in inputs.items()}
70+
output = model.generate(**inputs, max_length=64)
71+
print(tokenizer.decode(output[0]))
72+
```
2473

25-
## Overview
74+
## Quantization
2675

27-
The OLMoE model was proposed in [OLMoE: Open Mixture-of-Experts Language Models](https://huggingface.co/papers/2409.02060) by Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi.
76+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
77+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
2878

29-
OLMoE is a series of **O**pen **L**anguage **Mo**dels using sparse **M**ixture-**o**f-**E**xperts designed to enable the science of language models. We release all code, checkpoints, logs, and details involved in training these models.
79+
```py
80+
import torch
81+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
3082

31-
The abstract from the paper is the following:
83+
device = "cuda" if torch.cuda.is_available() else "cpu"
3284

33-
*We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.*
85+
quantization_config = BitsAndBytesConfig(
86+
load_in_4bit=True,
87+
bnb_4bit_compute_dtype=torch.float16,
88+
bnb_4bit_use_double_quant=True,
89+
bnb_4bit_quant_type="nf4"
90+
)
3491

35-
This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
36-
The original code can be found [here](https://github.com/allenai/OLMoE).
92+
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto", quantization_config=quantization_config).to(device)
93+
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
3794

95+
inputs = tokenizer("Bitcoin is", return_tensors="pt")
96+
inputs = {k: v.to(device) for k, v in inputs.items()}
97+
output = model.generate(**inputs, max_length=64)
98+
print(tokenizer.decode(output[0]))
99+
```
38100

39101
## OlmoeConfig
40102

0 commit comments

Comments
 (0)