Skip to content

Commit cbcb8e6

Browse files
cassiasampstevhliu
andauthored
updated mistral3 model card (#39531)
* updated mistral3 model card (#1) * updated mistral3 model card * applying suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * made all changes to mistral3.md * adding space between paragraphs in docs/source/en/model_doc/mistral3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * removing duplicate in mistral3.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * adding 4 backticks to preserve formatting --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 601260f commit cbcb8e6

File tree

1 file changed

+177
-173
lines changed

1 file changed

+177
-173
lines changed

docs/source/en/model_doc/mistral3.md

Lines changed: 177 additions & 173 deletions
Original file line numberDiff line numberDiff line change
@@ -13,116 +13,125 @@ specific language governing permissions and limitations under the License.
1313
rendered properly in your Markdown viewer.
1414
1515
-->
16+
<div style="float: right;">
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&amp;logo=pytorch&amp;logoColor=white">
19+
</div>
20+
</div>
1621

17-
# Mistral3
22+
# Mistral 3
1823

19-
## Overview
24+
[Mistral 3](https://mistral.ai/news/mistral-small-3) is a latency optimized model with a lot fewer layers to reduce the time per forward pass. This model adds vision understanding and supports long context lengths of up to 128K tokens without compromising performance.
2025

21-
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
26+
You can find the original Mistral 3 checkpoints under the [Mistral AI](https://huggingface.co/mistralai/models?search=mistral-small-3) organization.
2227

23-
It is ideal for:
24-
- Fast-response conversational agents.
25-
- Low-latency function calling.
26-
- Subject matter experts via fine-tuning.
27-
- Local inference for hobbyists and organizations handling sensitive data.
28-
- Programming and math reasoning.
29-
- Long document understanding.
30-
- Visual understanding.
3128

32-
This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez) and [yonigozlan](https://huggingface.co/yonigozlan).
29+
> [!TIP]
30+
> This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez) and [yonigozlan](https://huggingface.co/yonigozlan).
31+
> Click on the Mistral3 models in the right sidebar for more examples of how to apply Mistral3 to different tasks.
3332
34-
The original code can be found [here](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/pixtral.py) and [here](https://github.com/mistralai/mistral-common).
33+
The example below demonstrates how to generate text for an image with [`Pipeline`] and the [`AutoModel`] class.
3534

36-
## Usage example
35+
<hfoptions id="usage">
36+
<hfoption id="Pipeline">
3737

38-
### Inference with Pipeline
38+
```py
39+
import torch
40+
from transformers import pipeline
3941

40-
Here is how you can use the `image-text-to-text` pipeline to perform inference with the `Mistral3` models in just a few lines of code:
41-
```python
42-
>>> from transformers import pipeline
42+
messages = [
43+
{"role": "user",
44+
"content":[
45+
{"type": "image",
46+
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",},
47+
{"type": "text", "text": "Describe this image."}
48+
,]
49+
,}
50+
,]
4351

44-
>>> messages = [
45-
... {
46-
... "role": "user",
47-
... "content": [
48-
... {
49-
... "type": "image",
50-
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
51-
... },
52-
... {"type": "text", "text": "Describe this image."},
53-
... ],
54-
... },
55-
... ]
52+
pipeline = pipeline(
53+
task="image-text-to-text",
54+
model="mistralai/Mistral-Small-3.1-24B-Instruct-2503",
55+
torch_dtype=torch.bfloat16,
56+
device=0
57+
)
58+
outputs = pipeline(text=messages, max_new_tokens=50, return_full_text=False)
5659

57-
>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
58-
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
59-
>>> outputs[0]["generated_text"]
60+
outputs[0]["generated_text"]
6061
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'
6162
```
62-
### Inference on a single image
63-
64-
This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.
65-
66-
```python
67-
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
68-
>>> import torch
69-
70-
>>> torch_device = "cuda"
71-
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
72-
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
73-
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
74-
75-
>>> messages = [
76-
... {
77-
... "role": "user",
78-
... "content": [
79-
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
80-
... {"type": "text", "text": "Describe this image"},
81-
... ],
82-
... }
83-
... ]
84-
85-
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
86-
87-
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
88-
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
89-
90-
>>> decoded_output
91-
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...
63+
</hfoption>
64+
<hfoption id="AutoModel">
65+
66+
```py
67+
import torch
68+
from transformers import AutoProcessor, AutoModelForImageTextToText
69+
70+
torch_device = "cuda"
71+
model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
72+
processor = AutoProcessor.from_pretrained(model_checkpoint)
73+
model = AutoModelForImageTextToText.from_pretrained(
74+
model_checkpoint,
75+
device_map=torch_device,
76+
torch_dtype=torch.bfloat16
77+
)
78+
79+
messages = [
80+
{"role": "user",
81+
"content":[
82+
{"type": "image",
83+
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",},
84+
{"type": "text", "text": "Describe this image."}
85+
,]
86+
,}
87+
,]
88+
89+
inputs = processor.apply_chat_template(
90+
messages,
91+
add_generation_prompt=True,
92+
tokenize=True, return_dict=True,
93+
return_tensors="pt").to(model.device, dtype=torch.bfloat16)
94+
95+
generate_ids = model.generate(**inputs, max_new_tokens=20)
96+
decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
97+
98+
decoded_output
99+
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'
92100
```
101+
</hfoption>
102+
</hfoptions>
93103

94-
### Text-only generation
95-
This example shows how to generate text using the Mistral3 model without providing any image input.
104+
## Notes
96105

106+
- Mistral 3 supports text-only generation.
107+
```py
108+
from transformers import AutoProcessor, AutoModelForImageTextToText
109+
import torch
97110

98-
````python
99-
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
100-
>>> import torch
111+
torch_device = "cuda"
112+
model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
113+
processor = AutoProcessor.from_pretrained(model_checkpoint)
114+
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
101115

102-
>>> torch_device = "cuda"
103-
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
104-
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
105-
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
116+
SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
117+
user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."
106118

107-
>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
108-
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."
119+
messages = [
120+
{"role": "system", "content": SYSTEM_PROMPT},
121+
{"role": "user", "content": user_prompt},
122+
]
109123

110-
>>> messages = [
111-
... {"role": "system", "content": SYSTEM_PROMPT},
112-
... {"role": "user", "content": user_prompt},
113-
... ]
124+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
125+
inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
126+
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
127+
decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
114128

115-
>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
116-
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
117-
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
118-
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
119-
120-
>>> print(decoded_output)
129+
print(decoded_output)
121130
"1. À plus tard!
122-
2. Salut, à plus!
123-
3. À toute!
124-
4. À la prochaine!
125-
5. Je me casse, à plus!
131+
2. Salut, à plus!
132+
3. À toute!
133+
4. À la prochaine!
134+
5. Je me casse, à plus!
126135

127136
```
128137
/\_/\
@@ -131,98 +140,93 @@ This example shows how to generate text using the Mistral3 model without providi
131140
```"
132141
````
133142
134-
### Batched image and text inputs
135-
Mistral3 models also support batched image and text inputs.
136-
137-
```python
138-
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
139-
>>> import torch
140-
141-
>>> torch_device = "cuda"
142-
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
143-
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
144-
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
145-
146-
>>> messages = [
147-
... [
148-
... {
149-
... "role": "user",
150-
... "content": [
151-
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
152-
... {"type": "text", "text": "Write a haiku for this image"},
153-
... ],
154-
... },
155-
... ],
156-
... [
157-
... {
158-
... "role": "user",
159-
... "content": [
160-
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
161-
... {"type": "text", "text": "Describe this image"},
162-
... ],
163-
... },
164-
... ],
165-
... ]
166-
167-
168-
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
169-
170-
>>> output = model.generate(**inputs, max_new_tokens=25)
171-
172-
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
173-
>>> decoded_outputs
143+
- Mistral 3 accepts batched image and text inputs.
144+
```py
145+
from transformers import AutoProcessor, AutoModelForImageTextToText
146+
import torch
147+
148+
torch_device = "cuda"
149+
model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
150+
processor = AutoProcessor.from_pretrained(model_checkpoint)
151+
model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
152+
153+
messages = [
154+
[
155+
{
156+
"role": "user",
157+
"content": [
158+
{"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
159+
{"type": "text", "text": "Write a haiku for this image"},
160+
],
161+
},
162+
],
163+
[
164+
{
165+
"role": "user",
166+
"content": [
167+
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
168+
{"type": "text", "text": "Describe this image"},
169+
],
170+
},
171+
],
172+
]
173+
174+
175+
inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
176+
177+
output = model.generate(**inputs, max_new_tokens=25)
178+
179+
decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
180+
decoded_outputs
174181
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
175182
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]
176183
```
177184

178-
### Batched multi-image input and quantization with BitsAndBytes
179-
This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
180-
This example also how to use `BitsAndBytes` to load the model in 4bit quantization.
181-
182-
```python
183-
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
184-
>>> import torch
185-
186-
>>> torch_device = "cuda"
187-
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
188-
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
189-
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
190-
>>> model = AutoModelForImageTextToText.from_pretrained(
191-
... model_checkpoint, quantization_config=quantization_config
192-
... )
193-
194-
>>> messages = [
195-
...     [
196-
...         {
197-
...             "role": "user",
198-
...             "content": [
199-
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
200-
...                 {"type": "text", "text": "Write a haiku for this image"},
201-
...             ],
202-
...         },
203-
...     ],
204-
...     [
205-
...         {
206-
...             "role": "user",
207-
...             "content": [
208-
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
209-
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
210-
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
211-
...             ],
212-
...         },
213-
...     ],
214-
>>> ]
215-
216-
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
217-
218-
>>> output = model.generate(**inputs, max_new_tokens=25)
219-
220-
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
221-
>>> decoded_outputs
185+
- Mistral 3 also supported batched image and text inputs with a different number of images for each text. The example below quantizes the model with bitsandbytes.
186+
187+
```py
188+
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
189+
import torch
190+
191+
torch_device = "cuda"
192+
model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
193+
processor = AutoProcessor.from_pretrained(model_checkpoint)
194+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
195+
model = AutoModelForImageTextToText.from_pretrained(
196+
model_checkpoint, quantization_config=quantization_config
197+
)
198+
199+
messages = [
200+
    [
201+
        {
202+
            "role": "user",
203+
            "content": [
204+
                {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
205+
                {"type": "text", "text": "Write a haiku for this image"},
206+
            ],
207+
        },
208+
    ],
209+
    [
210+
        {
211+
            "role": "user",
212+
            "content": [
213+
                {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
214+
                {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
215+
                {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
216+
            ],
217+
        },
218+
    ],
219+
]
220+
221+
inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
222+
223+
output = model.generate(**inputs, max_new_tokens=25)
224+
225+
decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
226+
decoded_outputs
222227
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]
223228
```
224229

225-
226230
## Mistral3Config
227231

228232
[[autodoc]] Mistral3Config

0 commit comments

Comments
 (0)