-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Closed
Labels
Description
Your current environment
The output of python collect_env.py
vllm: 0.11.0
🐛 Describe the bug
When vllm gets a structured output request (e.g. json_object: true), and the speculative decoder outputs a token that doesn't fit the structure, it causes vllm to crash.
Server:
vllm serve meta-llama/Llama-3.1-8B-Instruct -tp 8 --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 4, "max_model_len": 2048}'
Client:
from concurrent.futures import ThreadPoolExecutor, wait
import requests
URL = "http://localhost:8000/v1/chat/completions"
HEADERS = {
"Authorization": "Bearer EMPTY",
"Content-Type": "application/json",
}
PAYLOAD = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [{"type": "text", "text": "Say hello world 500 times"}]}
],
"structured_outputs": {"json_object": True},
}
CONCURR = 30
def do_request():
return requests.post(URL, headers=HEADERS, json=PAYLOAD)
with ThreadPoolExecutor(max_workers=CONCURR) as executor:
while True:
futures =[executor.submit(do_request) for _ in range(CONCURR)]
wait(futures)Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done