-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 #28330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 #28330
Conversation
dde4fcb to
e059702
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a fix for tool call streaming when both reasoning and tool parsers are active. The change adds logic to detect the start of a tool call and skip the reasoning phase. While the intention is correct, the implementation has a flaw in how it manages state during this transition. This could lead to the tool parser not receiving the complete text, causing issues. I've added a critical review comment with a suggested code change to address this by simplifying the logic and leveraging the existing state management code.
💡 Codex Reviewvllm/vllm/entrypoints/openai/serving_chat.py Lines 950 to 1024 in dde4fcb
The new early‑exit detects ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Does this response-without-reasoning only happen for tool choice auto case? |
|
yes
|
llsj14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a nice fix!
I'm wondering if I understood correctly. I understood it like this,
- If the model produces reasoning content, it would be generated and should be handled before any tool call.
- Tool calls can appear without reasoning. Especially in Qwen3-VL with the Hermes tool parser, there are cases where the model outputs a tool call directly, even with the reasoning parser activated.
How about integrating your addition into the existing conditions like the following?
if (
not reasoning_end_arr[i]
and (
not tool_parser
or not hasattr(tool_parser, "tool_call_start_token")
or tool_parser.tool_call_start_token not in current_text
)
):
...
else:
reasoning_end_arr[i] = True…bled - Add early detection of tool call tokens in streaming mode - Skip reasoning phase when tool calls are present - Fixes issue vllm-project#28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
04a5f02 to
2df2ff4
Compare
|
already changed!
|
|
Would really like this merged before v0.11.1. Tool calling in Qwen is one of the things that lag behind SGLang recently. Currently, a workaround is to use the |
|
@baonudesifeizhai Have you tested whether the results are correct without using the |
|
without using the --reasoning-parser argument seems works well....
|
Yes. Qwen3-VL-32B-Instruct does not support |
|
|
|
@baonudesifeizhai It seems that the sign-off is missing in one of your commits: https://github.com/vllm-project/vllm/pull/28330/checks?check_run_id=54885615468 |
|
@chaunceyjiang Otherwise, can we get this merged before v0.11.1? Using this for Qwen3-VL-235B-A22B-Thinking. |
|
Can you paste the result without using the |
|
without using the --async-scheduling works fine.... === Testing without streaming ===
|
|
Hi @baonudesifeizhai, what I meant is: without using I tested
So I'm now quite confused: which issue does your PR actually resolve? |
Could you describe your issue in detail? And could you also provide the vLLM command you used? |
|
not sovled anything right now....so if he didnot have other problem i will close this pr....
|
|
@chaunceyjiang The issue is that https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 and https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 had substantial problems in this precise situation with tool calling in clients such as Cline/Roo Code. When reasoning is enabled with the tool calling also enabled, the The solution was to use the deepseek_r1 reasoning parser instead of qwen3, but there is potential for this to work properly in qwen3 as well. |
|
What doesn't work: pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}' What works: pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}' |
540716d to
fe3814f
Compare
#28297
Purpose
Test Plan
pytest tests/reasoning/test_qwen3_reasoning_parser.py -v
python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen3-VL-32B-Instruct
--port 8000
--tensor-parallel-size 8
--gpu-memory-utilization 0.9
--tool-call-parser hermes
--enable-auto-tool-choice
--limit-mm-per-prompt.video 0
--limit-mm-per-prompt.image 0
--async-scheduling
--enable-log-outputs
--enable-log-requests
(vllm) root@de44d613f1e7:~/vllm# cat test_tool_call_streaming.py
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
tools = [
{
"type": "function",
"function": {
"name": "search_products_general",
"description": "Search for products",
"parameters": {
"type": "object",
"properties": {
"description": {"type": "string"},
"config": {"type": "object"}
}
}
}
}
]
messages = [
{"role": "user", "content": "pc hp 16gb ram"}
]
print("=== Testing with streaming ===")
stream = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(f"Chunk: {chunk.choices[0].delta.content}")
if chunk.choices[0].delta.tool_calls:
print(f"Tool calls chunk: {chunk.choices[0].delta.tool_calls}")
print("\n=== Testing without streaming ===")
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=False
)
print(f"Response: {response.choices[0].message}")
if response.choices[0].message.tool_calls:
print(f"Tool calls: {response.choices[0].message.tool_calls}")
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.