From 7bf59ea4025c8987748875073642f02d84451f17 Mon Sep 17 00:00:00 2001 From: Gorka Bengochea Date: Mon, 10 Nov 2025 11:52:04 +0100 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9A=20docs(qwen3):=20add=20comprehensi?= =?UTF-8?q?ve=20usage=20examples=20and=20model=20details?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaced placeholder text with detailed documentation including: - Model architecture details and key features - Basic text generation example - Chat format usage with multi-turn conversations - Memory optimization with quantization - Long context (128K tokens) usage example - Performance tips and best practices This provides users with practical, ready-to-use examples for all common Qwen3-32B use cases, improving the developer experience for this model. --- docs/source/en/model_doc/qwen3.md | 142 +++++++++++++++++++++++++++++- 1 file changed, 140 insertions(+), 2 deletions(-) diff --git a/docs/source/en/model_doc/qwen3.md b/docs/source/en/model_doc/qwen3.md index 0141388fb97f..f80bdf2e8bcc 100644 --- a/docs/source/en/model_doc/qwen3.md +++ b/docs/source/en/model_doc/qwen3.md @@ -23,11 +23,149 @@ rendered properly in your Markdown viewer. ### Model Details -To be released with the official model launch. +Qwen3 is the dense 32B parameter variant in the Qwen3 series. Key architectural improvements include: + +- **Extended Context Length**: Supports up to 128K tokens context window +- **Enhanced Architecture**: Uses GQA (Grouped Query Attention) for improved efficiency +- **Dual Attention Mechanism**: Similar to Qwen2.5, alternates between local sliding window attention and global attention layers +- **Improved Training**: Post-trained with RLHF and advanced instruction tuning + +Qwen3-32B is available in both base (`Qwen/Qwen3-32B`) and instruction-tuned (`Qwen/Qwen3-32B-Instruct`) variants. ## Usage tips -To be released with the official model launch. +### Basic Text Generation + +Here's how to use Qwen3 for text generation: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch + +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-32B-Instruct", + torch_dtype=torch.bfloat16, + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct") + +# Basic generation +prompt = "Explain quantum computing in simple terms:" +inputs = tokenizer(prompt, return_tensors="pt").to(model.device) + +outputs = model.generate( + **inputs, + max_new_tokens=256, + temperature=0.7, + top_p=0.8, + do_sample=True +) + +response = tokenizer.decode(outputs[0], skip_special_tokens=True) +print(response) +``` + +### Chat Format + +For the instruction-tuned variant, use the chat template for multi-turn conversations: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch + +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-32B-Instruct", + torch_dtype=torch.bfloat16, + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct") + +messages = [ + {"role": "system", "content": "You are a helpful AI assistant."}, + {"role": "user", "content": "What are the main differences between Python and JavaScript?"} +] + +text = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True +) + +inputs = tokenizer([text], return_tensors="pt").to(model.device) + +outputs = model.generate( + **inputs, + max_new_tokens=512, + temperature=0.7 +) + +response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) +print(response) +``` + +### Memory Optimization + +For systems with limited GPU memory, use quantization: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +import torch + +# 4-bit quantization configuration +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.bfloat16, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type="nf4" +) + +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-32B-Instruct", + quantization_config=quantization_config, + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct") + +# Use as normal - memory footprint is significantly reduced +``` + +### Long Context Usage + +Qwen3 supports up to 128K tokens. For long documents: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch + +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-32B-Instruct", + torch_dtype=torch.bfloat16, + device_map="auto", + attn_implementation="flash_attention_2" # Recommended for long contexts +) +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B-Instruct") + +# Process long document +long_document = "..." * 10000 # Your long text here +prompt = f"Summarize the following document:\n\n{long_document}" + +inputs = tokenizer(prompt, return_tensors="pt", truncation=False).to(model.device) + +# Note: Ensure your input doesn't exceed 128K tokens +print(f"Input tokens: {inputs.input_ids.shape[1]}") + +outputs = model.generate(**inputs, max_new_tokens=500) +summary = tokenizer.decode(outputs[0], skip_special_tokens=True) +print(summary) +``` + +### Tips for Best Performance + +- **Use `torch.bfloat16`**: Provides best balance of speed and quality +- **Enable Flash Attention 2**: Significantly faster for long contexts (`attn_implementation="flash_attention_2"`) +- **Batch Processing**: Process multiple inputs together for better throughput +- **Temperature Tuning**: Lower (0.1-0.5) for factual tasks, higher (0.7-1.0) for creative tasks +- **System Prompt**: Use clear system prompts for instruction-tuned variant to guide behavior ## Qwen3Config