Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 #28330

baonudesifeizhai · 2025-11-07T23:35:59Z

Add early detection of tool call tokens in streaming mode
Skip reasoning phase when tool calls are present
Fixes issue [Bug]: Qwen3 VL with hermes tool parser streaming issue #28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser

Purpose

Test Plan

pytest tests/reasoning/test_qwen3_reasoning_parser.py -v

python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen3-VL-32B-Instruct
--port 8000
--tensor-parallel-size 8
--gpu-memory-utilization 0.9
--tool-call-parser hermes
--enable-auto-tool-choice
--limit-mm-per-prompt.video 0
--limit-mm-per-prompt.image 0
--async-scheduling
--enable-log-outputs
--enable-log-requests
(vllm) root@de44d613f1e7:~/vllm# cat test_tool_call_streaming.py
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)

tools = [
{
"type": "function",
"function": {
"name": "search_products_general",
"description": "Search for products",
"parameters": {
"type": "object",
"properties": {
"description": {"type": "string"},
"config": {"type": "object"}
}
}
}
}
]

messages = [
{"role": "user", "content": "pc hp 16gb ram"}
]

print("=== Testing with streaming ===")
stream = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(f"Chunk: {chunk.choices[0].delta.content}")
if chunk.choices[0].delta.tool_calls:
print(f"Tool calls chunk: {chunk.choices[0].delta.tool_calls}")

print("\n=== Testing without streaming ===")
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=False
)

print(f"Response: {response.choices[0].message}")
if response.choices[0].message.tool_calls:
print(f"Tool calls: {response.choices[0].message.tool_calls}")

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a fix for tool call streaming when both reasoning and tool parsers are active. The change adds logic to detect the start of a tool call and skip the reasoning phase. While the intention is correct, the implementation has a flaw in how it manages state during this transition. This could lead to the tool parser not receiving the complete text, causing issues. I've added a critical review comment with a suggested code change to address this by simplifying the logic and leveraging the existing state management code.

chatgpt-codex-connector · 2025-11-07T23:41:06Z

💡 Codex Review

vllm/vllm/entrypoints/openai/serving_chat.py

Lines 950 to 1024 in dde4fcb

    
           # Check if tool call tokens are present in the output 
        
           # If so, skip reasoning and go directly to tool parsing 
        
           # This handles cases where models output tool calls without 
        
           # reasoning content (e.g., Qwen3-VL with hermes tool parser) 
        
           if ( 
        
               not reasoning_end_arr[i] 
        
               and tool_parser 
        
               and hasattr(tool_parser, "tool_call_start_token") 
        
               and tool_parser.tool_call_start_token in current_text 
        
           ): 
        
               reasoning_end_arr[i] = True 
        
               # Prepare for tool parsing by resetting state 
        
               if not added_content_delta_arr[i]: 
        
                   added_content_delta_arr[i] = True 
        
                   previous_text = "" 
        
                   previous_token_ids = [] 
        
           if not reasoning_end_arr[i]: 
        
               delta_message = ( 
        
                   reasoning_parser.extract_reasoning_content_streaming( 
        
                       previous_text, 
        
                       current_text, 
        
                       delta_text, 
        
                       previous_token_ids, 
        
                       current_token_ids, 
        
                       output_token_ids, 
        
                   ) 
        
               ) 
        
               # When encountering think end id in prompt_token_ids 
        
               # i.e {"enable_thinking": False}, 
        
               # set reasoning status to end. 
        
               # Remove the text and token ids related 
        
               # to 'reasoning_content'. 
        
               if ( 
        
                   res.prompt_token_ids 
        
                   and reasoning_parser.is_reasoning_end( 
        
                       res.prompt_token_ids 
        
                   ) 
        
               ): 
        
                   reasoning_end_arr[i] = True 
        
                   current_token_ids = output_token_ids 
        
                   if delta_message and delta_message.content: 
        
                       current_text = delta_message.content 
        
                       delta_message.content = None 
        
                   else: 
        
                       current_text = "" 
        
               # When encountering think end id in delta_token_ids, 
        
               # set reasoning status to end. 
        
               # Remove the text and token ids related 
        
               # to 'reasoning_content'. 
        
               if reasoning_parser.is_reasoning_end(output_token_ids): 
        
                   reasoning_end_arr[i] = True 
        
                   current_token_ids = ( 
        
                       reasoning_parser.extract_content_ids( 
        
                           output_token_ids 
        
                       ) 
        
                   ) 
        
                   if delta_message and delta_message.content: 
        
                       current_text = delta_message.content 
        
                       delta_message.content = None 
        
                   else: 
        
                       current_text = "" 
        
           # handle tool calls only after reasoning is done, 
        
           else: 
        
               delta_token_ids = output_token_ids 
        
               # First time to tool call, 
        
               # add the remaining text and token ids 
        
               # to delta from previous 
        
               if not added_content_delta_arr[i]: 
        
                   added_content_delta_arr[i] = True 
        
                   previous_text = "" 
        
                   previous_token_ids = [] 
        
                   delta_text = current_text 
        
                   delta_token_ids = current_token_ids

Replay accumulated tool tokens when skipping reasoning

The new early‑exit detects tool_call_start_token and marks reasoning_end_arr[i] = True, but it sets added_content_delta_arr[i] before the branch that replays buffered text (if not added_content_delta_arr[i]: … delta_text = current_text; delta_token_ids = current_token_ids). As a result, when <tool_call> spans multiple streaming chunks, the tokens emitted before the final detection remain only in current_text and are never copied into delta_text/delta_token_ids for the first tool parser invocation. Tool parsers that parse solely from the delta (e.g. qwen3xml_tool_parser.parse_single_streaming_chunks) will now start parsing in the middle of the tool call and fail to detect it. The reasoning skip should still forward the accumulated text and token ids before handing control to the tool parser.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cjackal · 2025-11-08T05:14:47Z

Does this response-without-reasoning only happen for tool choice auto case?

baonudesifeizhai · 2025-11-08T05:51:48Z

yes

Does this response-without-reasoning only happen for tool choice auto case?

llsj14

I think this is a nice fix!
I'm wondering if I understood correctly. I understood it like this,

If the model produces reasoning content, it would be generated and should be handled before any tool call.
Tool calls can appear without reasoning. Especially in Qwen3-VL with the Hermes tool parser, there are cases where the model outputs a tool call directly, even with the reasoning parser activated.

How about integrating your addition into the existing conditions like the following?

if (
    not reasoning_end_arr[i]
    and (
        not tool_parser
        or not hasattr(tool_parser, "tool_call_start_token")
        or tool_parser.tool_call_start_token not in current_text
    )
):
   ...

else:
    reasoning_end_arr[i] = True

…bled - Add early detection of tool call tokens in streaming mode - Skip reasoning phase when tool calls are present - Fixes issue vllm-project#28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>

baonudesifeizhai · 2025-11-08T22:54:36Z

already changed!
but
if (
not reasoning_end_arr[i]
and (
not tool_parser
or not hasattr(tool_parser, "tool_call_start_token")
or tool_parser.tool_call_start_token not in current_text
)
):
has been ruff format changed to if not reasoning_end_arr[i] and (
not tool_parser
or not hasattr(tool_parser, "tool_call_start_token")
or tool_parser.tool_call_start_token not in current_text
):

I think this is a nice fix! I'm wondering if I understood correctly. I understood it like this,

If the model produces reasoning content, it would be generated and should be handled before any tool call.

Tool calls can appear without reasoning. Especially in Qwen3-VL with the Hermes tool parser, there are cases where the model outputs a tool call directly, even with the reasoning parser activated.

How about integrating your addition into the existing conditions like the following?
if (
    not reasoning_end_arr[i]
    and (
        not tool_parser
        or not hasattr(tool_parser, "tool_call_start_token")
        or tool_parser.tool_call_start_token not in current_text
    )
):
   ...

else:
    reasoning_end_arr[i] = True

ehfd · 2025-11-09T05:35:00Z

Would really like this merged before v0.11.1. Tool calling in Qwen is one of the things that lag behind SGLang recently.

Currently, a workaround is to use the deepseek_r1 reasoning parser while retaining the hermes tool parser.

chaunceyjiang · 2025-11-10T03:08:55Z

@baonudesifeizhai Qwen3-VL-32B does not support reasoning, so there's no need to set --reasoning-parser qwen3.

Have you tested whether the results are correct without using the --reasoning-parser argument?

baonudesifeizhai · 2025-11-10T05:34:15Z

without using the --reasoning-parser argument seems works well....

@baonudesifeizhai Qwen3-VL-32B does not support reasoning, so there's no need to set --reasoning-parser qwen3.

Have you tested whether the results are correct without using the --reasoning-parser argument?

chaunceyjiang · 2025-11-10T05:42:17Z

without using the --reasoning-parser argument seems works well....

Yes. Qwen3-VL-32B-Instruct does not support reasoning.

chaunceyjiang · 2025-11-10T05:46:29Z

Qwen3-VL-32B-Thinking support reasoning.

ehfd · 2025-11-15T10:18:01Z

@baonudesifeizhai It seems that the sign-off is missing in one of your commits: https://github.com/vllm-project/vllm/pull/28330/checks?check_run_id=54885615468

ehfd · 2025-11-15T10:23:30Z

@chaunceyjiang Otherwise, can we get this merged before v0.11.1? Using this for Qwen3-VL-235B-A22B-Thinking.

chaunceyjiang · 2025-11-16T09:16:10Z

Can you paste the result without using the --async-scheduling parameter?

baonudesifeizhai · 2025-11-16T20:17:00Z

without using the --async-scheduling works fine....
(.venv) root@df308fb7b11d:~/vllm# python test_tool_call_streaming.py
=== Testing with streaming ===
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-a5fb728d638b4b78815d9efa6cc069e0', function=ChoiceDeltaToolCallFunction(arguments=None, name='search_products_general'), type='function')]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='{"description": "', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='pc', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' hp', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='1', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='6', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='gb', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ram', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='",', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' "', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='config', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='":', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' {"', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='num', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='_results', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='":', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='5', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='}}', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='', name=None), type=None)]

=== Testing without streaming ===
Response: ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-4fc128eb676c4ebabbab16e140c96206', function=Function(arguments='{"description": "pc hp 16gb ram", "config": {"num_results": 5}}', name='search_products_general'), type='function')], reasoning=None, reasoning_content=None)
Tool calls: [ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-4fc128eb676c4ebabbab16e140c96206', function=Function(arguments='{"description": "pc hp 16gb ram", "config": {"num_results": 5}}', name='search_products_general'), type='function')]
(.venv) root@df308fb7b11d:~/vllm#

Can you paste the result without using the --async-scheduling parameter?

chaunceyjiang · 2025-11-17T02:33:04Z

Hi @baonudesifeizhai, what I meant is: without using --async-scheduling, does the issue described in your PR #28297 still exist? Also, it seems that #28297 has already been closed.

I tested tool_call_streaming locally with the latest main branch, and it works correctly. As I mentioned earlier, Qwen/Qwen3-VL-32B-Instruct does not support reasoning, so there's no need to specify the parameter "--reasoning-parser", "qwen3".

Add early detection of tool call tokens in streaming mode
Skip reasoning phase when tool calls are present
Fixes issue #28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser

So I'm now quite confused: which issue does your PR actually resolve?

chaunceyjiang · 2025-11-17T02:34:38Z

@chaunceyjiang Otherwise, can we get this merged before v0.11.1? Using this for Qwen3-VL-235B-A22B-Thinking.

@ehfd

Could you describe your issue in detail? And could you also provide the vLLM command you used?

baonudesifeizhai · 2025-11-17T02:41:11Z

not sovled anything right now....so if he didnot have other problem i will close this pr....

Hi @baonudesifeizhai, what I meant is: without using --async-scheduling, does the issue described in your PR #28297 still exist? Also, it seems that #28297 has already been closed.

I tested tool_call_streaming locally with the latest main branch, and it works correctly. As I mentioned earlier, Qwen/Qwen3-VL-32B-Instruct does not support reasoning, so there's no need to specify the parameter "--reasoning-parser", "qwen3".

Add early detection of tool call tokens in streaming mode
Skip reasoning phase when tool calls are present
Fixes issue #28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser

So I'm now quite confused: which issue does your PR actually resolve?

ehfd · 2025-11-17T03:13:39Z

@chaunceyjiang The issue is that https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 and https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 had substantial problems in this precise situation with tool calling in clients such as Cline/Roo Code.

When reasoning is enabled with the tool calling also enabled, the <think> tool calls are exposed bare, with all the thinking content exposed together with <think></think>.

The solution was to use the deepseek_r1 reasoning parser instead of qwen3, but there is potential for this to work properly in qwen3 as well.

ehfd · 2025-11-17T03:15:18Z

What doesn't work:

pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}'

What works:

pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}'

baonudesifeizhai requested review from aarnphm and chaunceyjiang as code owners November 7, 2025 23:36

mergify bot added the frontend label Nov 7, 2025

baonudesifeizhai force-pushed the fix/tool-call-streaming-with-reasoning-parser branch from dde4fcb to e059702 Compare November 7, 2025 23:37

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

llsj14 reviewed Nov 8, 2025

View reviewed changes

baonudesifeizhai force-pushed the fix/tool-call-streaming-with-reasoning-parser branch 2 times, most recently from 04a5f02 to 2df2ff4 Compare November 8, 2025 21:46

change for reviewer advice

fe3814f

chaunceyjiang self-assigned this Nov 10, 2025

baonudesifeizhai requested a review from mgoin as a code owner November 18, 2025 00:27

baonudesifeizhai requested review from WoosukKwon, tlrmchlsmth and yewentao256 as code owners November 18, 2025 00:27

mergify bot added the nvidia label Nov 18, 2025

github-project-automation bot added this to NVIDIA Nov 18, 2025

baonudesifeizhai force-pushed the fix/tool-call-streaming-with-reasoning-parser branch from 540716d to fe3814f Compare November 18, 2025 00:45

Uh oh!

Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 #28330

Are you sure you want to change the base?

Fix: tool call streaming when both reasoning and tool parsers are enabled #28297 #28330

Uh oh!

Conversation

baonudesifeizhai commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Nov 7, 2025

💡 Codex Review

Uh oh!

cjackal commented Nov 8, 2025

Uh oh!

baonudesifeizhai commented Nov 8, 2025

Uh oh!

llsj14 left a comment

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang commented Nov 10, 2025

Uh oh!

baonudesifeizhai commented Nov 10, 2025

Uh oh!

chaunceyjiang commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang commented Nov 10, 2025

Uh oh!

ehfd commented Nov 15, 2025

Uh oh!

ehfd commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang commented Nov 16, 2025

Uh oh!

baonudesifeizhai commented Nov 16, 2025

Uh oh!

chaunceyjiang commented Nov 17, 2025

Uh oh!

chaunceyjiang commented Nov 17, 2025

Uh oh!

baonudesifeizhai commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehfd commented Nov 17, 2025

Uh oh!

ehfd commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

baonudesifeizhai commented Nov 7, 2025 •

edited by github-actions bot

Loading

baonudesifeizhai commented Nov 8, 2025 •

edited

Loading

ehfd commented Nov 9, 2025 •

edited

Loading

chaunceyjiang commented Nov 10, 2025 •

edited

Loading

ehfd commented Nov 15, 2025 •

edited

Loading

baonudesifeizhai commented Nov 17, 2025 •

edited

Loading