Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 140 additions & 2 deletions docs/source/en/model_doc/qwen3.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,149 @@ rendered properly in your Markdown viewer.

### Model Details

To be released with the official model launch.
Qwen3 is the dense 32B parameter variant in the Qwen3 series. Key architectural improvements include:

- **Extended Context Length**: Supports up to 128K tokens context window
- **Enhanced Architecture**: Uses GQA (Grouped Query Attention) for improved efficiency
- **Dual Attention Mechanism**: Similar to Qwen2.5, alternates between local sliding window attention and global attention layers
- **Improved Training**: Post-trained with RLHF and advanced instruction tuning

Qwen3-32B is available in both base (`Qwen/Qwen3-32B`) and instruction-tuned (`Qwen/Qwen3-32B-Instruct`) variants.

## Usage tips

To be released with the official model launch.
### Basic Text Generation

Here's how to use Qwen3 for text generation:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct")

# Basic generation
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.8,
do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Chat Format

For the instruction-tuned variant, use the chat template for multi-turn conversations:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct")

messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the main differences between Python and JavaScript?"}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

### Memory Optimization

For systems with limited GPU memory, use quantization:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct")

# Use as normal - memory footprint is significantly reduced
```

### Long Context Usage

Qwen3 supports up to 128K tokens. For long documents:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2" # Recommended for long contexts
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct")

# Process long document
long_document = "..." * 10000 # Your long text here
prompt = f"Summarize the following document:\n\n{long_document}"

inputs = tokenizer(prompt, return_tensors="pt", truncation=False).to(model.device)

# Note: Ensure your input doesn't exceed 128K tokens
print(f"Input tokens: {inputs.input_ids.shape[1]}")

outputs = model.generate(**inputs, max_new_tokens=500)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)
```

### Tips for Best Performance

- **Use `torch.bfloat16`**: Provides best balance of speed and quality
- **Enable Flash Attention 2**: Significantly faster for long contexts (`attn_implementation="flash_attention_2"`)
- **Batch Processing**: Process multiple inputs together for better throughput
- **Temperature Tuning**: Lower (0.1-0.5) for factual tasks, higher (0.7-1.0) for creative tasks
- **System Prompt**: Use clear system prompts for instruction-tuned variant to guide behavior

## Qwen3Config

Expand Down