Skip to content

Commit 641aaed

Browse files
orionwoweller2stevhliu
authored
Update modernbertdecoder docs (#39453)
* update docs with paper and real model * nit * Apply suggestions from code review Thanks to @stevhlui! Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Remove usage examples, add quantization --------- Co-authored-by: oweller2 <oweller2@dsailogin.mgmt.ai.cluster> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 049a674 commit 641aaed

File tree

1 file changed

+51
-18
lines changed

1 file changed

+51
-18
lines changed

docs/source/en/model_doc/modernbert-decoder.md

Lines changed: 51 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,18 @@ rendered properly in your Markdown viewer.
2424

2525
# ModernBERT Decoder
2626

27-
ModernBERT Decoder is the same architecture as [ModernBERT](https://huggingface.co/papers/2412.13663) but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
27+
ModernBERT Decoder has the same architecture as [ModernBERT](https://huggingface.co/papers/2412.13663) but it is trained from scratch with a causal language modeling objective from the [Ettin paper](https://huggingface.co/papers/2507.11412). This allows for using the same architecture to compare encoders and decoders. This model is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
2828

29-
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
29+
ModernBERT Decoder uses sliding window attention and rotary positional embeddings for efficiency and to handle longer sequences.
30+
31+
You can find all the original ModernBERT Decoder checkpoints under the [jhu-clsp](https://huggingface.co/collections/jhu-clsp/encoders-vs-decoders-the-ettin-suite-686303e16142257eed8e6aeb) collection.
3032

3133
> [!TIP]
34+
> This model was contributed by [orionw](https://huggingface.co/orionweller).
35+
>
3236
> Click on the ModernBERT Decoder models in the right sidebar for more examples of how to apply ModernBERT Decoder to different text generation tasks.
3337
34-
The example below demonstrates how to use ModernBERT Decoder for text generation with [`Pipeline`], [`AutoModel`], and from the command line.
38+
The example below demonstrates how to use ModernBERT Decoder for text generation with [`Pipeline`], [`AutoModel`] (with and without quantization), and from the command line.
3539

3640
<hfoptions id="usage">
3741
<hfoption id="Pipeline">
@@ -42,7 +46,7 @@ from transformers import pipeline
4246

4347
generator = pipeline(
4448
task="text-generation",
45-
model="blab-jhu/test-32m-dec",
49+
model="jhu-clsp/ettin-decoder-17m",
4650
torch_dtype=torch.float16,
4751
device=0
4852
)
@@ -51,7 +55,7 @@ generator("The future of artificial intelligence is", max_length=50, num_return_
5155
# For sequence classification
5256
classifier = pipeline(
5357
task="text-classification",
54-
model="blab-jhu/test-32m-dec",
58+
model="jhu-clsp/ettin-decoder-17m",
5559
torch_dtype=torch.float16,
5660
device=0
5761
)
@@ -65,9 +69,9 @@ classifier("This movie is really great!")
6569
import torch
6670
from transformers import AutoModelForCausalLM, AutoTokenizer
6771

68-
tokenizer = AutoTokenizer.from_pretrained("blab-jhu/test-32m-dec")
72+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-17m")
6973
model = AutoModelForCausalLM.from_pretrained(
70-
"blab-jhu/test-32m-dec",
74+
"jhu-clsp/ettin-decoder-17m",
7175
torch_dtype=torch.float16,
7276
device_map="auto",
7377
)
@@ -92,7 +96,7 @@ print(f"Generated text: {generated_text}")
9296
from transformers import AutoModelForSequenceClassification
9397

9498
classifier_model = AutoModelForSequenceClassification.from_pretrained(
95-
"blab-jhu/test-32m-dec",
99+
"jhu-clsp/ettin-decoder-17m",
96100
torch_dtype=torch.float16,
97101
device_map="auto",
98102
num_labels=2
@@ -111,15 +115,53 @@ print(f"Prediction probabilities: {predictions}")
111115
```
112116

113117
</hfoption>
118+
119+
<hfoption id="AutoModel (w/quantization)">
120+
121+
```
122+
import torch
123+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
124+
125+
quantization_config = BitsAndBytesConfig(
126+
load_in_8bit=True,
127+
)
128+
129+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-1b")
130+
model = AutoModelForCausalLM.from_pretrained(
131+
"jhu-clsp/ettin-decoder-1b",
132+
torch_dtype=torch.float16,
133+
device_map="auto",
134+
quantization_config=quantization_config
135+
)
136+
137+
prompt = "The future of artificial intelligence is"
138+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
139+
140+
with torch.no_grad():
141+
outputs = model.generate(
142+
**inputs,
143+
max_length=50,
144+
num_return_sequences=1,
145+
temperature=0.7,
146+
do_sample=True,
147+
pad_token_id=tokenizer.eos_token_id
148+
)
149+
150+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
151+
print(f"Generated text: {generated_text}")
152+
```
153+
</hfoption>
154+
114155
<hfoption id="transformers CLI">
115156

116157
```bash
117-
echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0
158+
echo "The future of artificial intelligence is" | transformers run --task text-generation --model jhu-clsp/ettin-decoder-17m --device 0
118159
```
119160

120161
</hfoption>
121162
</hfoptions>
122163

164+
123165
## ModernBertDecoderConfig
124166

125167
[[autodoc]] ModernBertDecoderConfig
@@ -142,14 +184,5 @@ echo "The future of artificial intelligence is" | transformers run --task text-g
142184
[[autodoc]] ModernBertDecoderForSequenceClassification
143185
- forward
144186

145-
### Usage tips
146-
147-
The ModernBertDecoder model can be fine-tuned for various text generation tasks using the HuggingFace Transformers library. It supports efficient inference with features like:
148-
149-
- **Causal attention**: Ensures autoregressive generation by masking future tokens
150-
- **Sliding window attention**: Alternates between local and global attention patterns for efficiency
151-
- **Rotary positional embeddings**: Enables handling of longer sequences up to 8000 tokens
152-
- **FlashAttention support**: Optimized attention computation for faster training and inference
153-
154187
</pt>
155188
</frameworkcontent>

0 commit comments

Comments
 (0)