You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update OLMoE model card
* Checks Test
* Add license and code
* Update docs/source/en/model_doc/olmoe.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update olmoe.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
[OLMoE](https://huggingface.co/papers/2409.02060) is a sparse Mixture-of-Experts (MoE) language model with 7B parameters but only 1B parameters are used per input token. It has similar inference costs as dense models but trains ~3x faster. OLMoE uses fine-grained routing with 64 small experts in each layer and uses a dropless token-based routing algorithm.
28
+
29
+
You can find all the original OLMoE checkpoints under the [OLMoE](https://huggingface.co/collections/allenai/olmoe-november-2024-66cf678c047657a30c8cd3da) collection.
30
+
31
+
> [!TIP]
32
+
> This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
33
+
>
34
+
> Click on the OLMoE models in the right sidebar for more examples of how to apply OLMoE to different language tasks.
35
+
36
+
The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
37
+
38
+
<hfoptionsid="usage">
39
+
<hfoptionid="Pipeline">
40
+
41
+
```py
42
+
import torch
43
+
from transformers import pipeline
44
+
45
+
pipe = pipeline(
46
+
task="text-generation",
47
+
model="allenai/OLMoE-1B-7B-0125",
48
+
torch_dtype=torch.float16,
49
+
device=0,
50
+
)
51
+
52
+
result = pipe("Dionysus is the god of")
53
+
print(result)
54
+
```
55
+
56
+
</hfoption>
57
+
<hfoptionid="AutoModel">
58
+
59
+
```py
60
+
import torch
61
+
from transformers import AutoModelForCausalLM, AutoTokenizer
inputs = {k: v.to(device) for k, v in inputs.items()}
70
+
output = model.generate(**inputs, max_length=64)
71
+
print(tokenizer.decode(output[0]))
72
+
```
24
73
25
-
## Overview
74
+
## Quantization
26
75
27
-
The OLMoE model was proposed in [OLMoE: Open Mixture-of-Experts Language Models](https://huggingface.co/papers/2409.02060) by Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi.
76
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
77
+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
28
78
29
-
OLMoE is a series of **O**pen **L**anguage **Mo**dels using sparse **M**ixture-**o**f-**E**xperts designed to enable the science of language models. We release all code, checkpoints, logs, and details involved in training these models.
79
+
```py
80
+
import torch
81
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
*We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.*
85
+
quantization_config = BitsAndBytesConfig(
86
+
load_in_4bit=True,
87
+
bnb_4bit_compute_dtype=torch.float16,
88
+
bnb_4bit_use_double_quant=True,
89
+
bnb_4bit_quant_type="nf4"
90
+
)
34
91
35
-
This model was contributed by [Muennighoff](https://hf.co/Muennighoff).
36
-
The original code can be found [here](https://github.com/allenai/OLMoE).
92
+
model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", attn_implementation="sdpa", torch_dtype="auto", device_map="auto", quantization_config=quantization_config).to(device)
0 commit comments