Skip to content

Conversation

@baonudesifeizhai
Copy link
Contributor

@baonudesifeizhai baonudesifeizhai commented Nov 7, 2025

#28297

Purpose

Test Plan

pytest tests/reasoning/test_qwen3_reasoning_parser.py -v

python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen3-VL-32B-Instruct
--port 8000
--tensor-parallel-size 8
--gpu-memory-utilization 0.9
--tool-call-parser hermes
--enable-auto-tool-choice
--limit-mm-per-prompt.video 0
--limit-mm-per-prompt.image 0
--async-scheduling
--enable-log-outputs
--enable-log-requests
(vllm) root@de44d613f1e7:~/vllm# cat test_tool_call_streaming.py
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)

tools = [
{
"type": "function",
"function": {
"name": "search_products_general",
"description": "Search for products",
"parameters": {
"type": "object",
"properties": {
"description": {"type": "string"},
"config": {"type": "object"}
}
}
}
}
]

messages = [
{"role": "user", "content": "pc hp 16gb ram"}
]

print("=== Testing with streaming ===")
stream = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(f"Chunk: {chunk.choices[0].delta.content}")
if chunk.choices[0].delta.tool_calls:
print(f"Tool calls chunk: {chunk.choices[0].delta.tool_calls}")

print("\n=== Testing without streaming ===")
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=messages,
tools=tools,
stream=False
)

print(f"Response: {response.choices[0].message}")
if response.choices[0].message.tool_calls:
print(f"Tool calls: {response.choices[0].message.tool_calls}")

Test Result

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the frontend label Nov 7, 2025
@baonudesifeizhai baonudesifeizhai force-pushed the fix/tool-call-streaming-with-reasoning-parser branch from dde4fcb to e059702 Compare November 7, 2025 23:37
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for tool call streaming when both reasoning and tool parsers are active. The change adds logic to detect the start of a tool call and skip the reasoning phase. While the intention is correct, the implementation has a flaw in how it manages state during this transition. This could lead to the tool parser not receiving the complete text, causing issues. I've added a critical review comment with a suggested code change to address this by simplifying the logic and leveraging the existing state management code.

@chatgpt-codex-connector
Copy link

💡 Codex Review

# Check if tool call tokens are present in the output
# If so, skip reasoning and go directly to tool parsing
# This handles cases where models output tool calls without
# reasoning content (e.g., Qwen3-VL with hermes tool parser)
if (
not reasoning_end_arr[i]
and tool_parser
and hasattr(tool_parser, "tool_call_start_token")
and tool_parser.tool_call_start_token in current_text
):
reasoning_end_arr[i] = True
# Prepare for tool parsing by resetting state
if not added_content_delta_arr[i]:
added_content_delta_arr[i] = True
previous_text = ""
previous_token_ids = []
if not reasoning_end_arr[i]:
delta_message = (
reasoning_parser.extract_reasoning_content_streaming(
previous_text,
current_text,
delta_text,
previous_token_ids,
current_token_ids,
output_token_ids,
)
)
# When encountering think end id in prompt_token_ids
# i.e {"enable_thinking": False},
# set reasoning status to end.
# Remove the text and token ids related
# to 'reasoning_content'.
if (
res.prompt_token_ids
and reasoning_parser.is_reasoning_end(
res.prompt_token_ids
)
):
reasoning_end_arr[i] = True
current_token_ids = output_token_ids
if delta_message and delta_message.content:
current_text = delta_message.content
delta_message.content = None
else:
current_text = ""
# When encountering think end id in delta_token_ids,
# set reasoning status to end.
# Remove the text and token ids related
# to 'reasoning_content'.
if reasoning_parser.is_reasoning_end(output_token_ids):
reasoning_end_arr[i] = True
current_token_ids = (
reasoning_parser.extract_content_ids(
output_token_ids
)
)
if delta_message and delta_message.content:
current_text = delta_message.content
delta_message.content = None
else:
current_text = ""
# handle tool calls only after reasoning is done,
else:
delta_token_ids = output_token_ids
# First time to tool call,
# add the remaining text and token ids
# to delta from previous
if not added_content_delta_arr[i]:
added_content_delta_arr[i] = True
previous_text = ""
previous_token_ids = []
delta_text = current_text
delta_token_ids = current_token_ids

P1 Badge Replay accumulated tool tokens when skipping reasoning

The new early‑exit detects tool_call_start_token and marks reasoning_end_arr[i] = True, but it sets added_content_delta_arr[i] before the branch that replays buffered text (if not added_content_delta_arr[i]: … delta_text = current_text; delta_token_ids = current_token_ids). As a result, when <tool_call> spans multiple streaming chunks, the tokens emitted before the final detection remain only in current_text and are never copied into delta_text/delta_token_ids for the first tool parser invocation. Tool parsers that parse solely from the delta (e.g. qwen3xml_tool_parser.parse_single_streaming_chunks) will now start parsing in the middle of the tool call and fail to detect it. The reasoning skip should still forward the accumulated text and token ids before handing control to the tool parser.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@cjackal
Copy link
Contributor

cjackal commented Nov 8, 2025

Does this response-without-reasoning only happen for tool choice auto case?

@baonudesifeizhai
Copy link
Contributor Author

yes

Does this response-without-reasoning only happen for tool choice auto case?

Copy link
Contributor

@llsj14 llsj14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice fix!
I'm wondering if I understood correctly. I understood it like this,

  • If the model produces reasoning content, it would be generated and should be handled before any tool call.
  • Tool calls can appear without reasoning. Especially in Qwen3-VL with the Hermes tool parser, there are cases where the model outputs a tool call directly, even with the reasoning parser activated.

How about integrating your addition into the existing conditions like the following?

if (
    not reasoning_end_arr[i]
    and (
        not tool_parser
        or not hasattr(tool_parser, "tool_call_start_token")
        or tool_parser.tool_call_start_token not in current_text
    )
):
   ...

else:
    reasoning_end_arr[i] = True

…bled

- Add early detection of tool call tokens in streaming mode
- Skip reasoning phase when tool calls are present
- Fixes issue vllm-project#28297 where hermes tool parser fails in streaming mode
  when used together with qwen3 reasoning parser

Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
@baonudesifeizhai baonudesifeizhai force-pushed the fix/tool-call-streaming-with-reasoning-parser branch 2 times, most recently from 04a5f02 to 2df2ff4 Compare November 8, 2025 21:46
@baonudesifeizhai
Copy link
Contributor Author

baonudesifeizhai commented Nov 8, 2025

already changed!
but
if (
not reasoning_end_arr[i]
and (
not tool_parser
or not hasattr(tool_parser, "tool_call_start_token")
or tool_parser.tool_call_start_token not in current_text
)
):
has been ruff format changed to if not reasoning_end_arr[i] and (
not tool_parser
or not hasattr(tool_parser, "tool_call_start_token")
or tool_parser.tool_call_start_token not in current_text
):

I think this is a nice fix! I'm wondering if I understood correctly. I understood it like this,

  • If the model produces reasoning content, it would be generated and should be handled before any tool call.
  • Tool calls can appear without reasoning. Especially in Qwen3-VL with the Hermes tool parser, there are cases where the model outputs a tool call directly, even with the reasoning parser activated.

How about integrating your addition into the existing conditions like the following?

if (
    not reasoning_end_arr[i]
    and (
        not tool_parser
        or not hasattr(tool_parser, "tool_call_start_token")
        or tool_parser.tool_call_start_token not in current_text
    )
):
   ...

else:
    reasoning_end_arr[i] = True

@ehfd
Copy link

ehfd commented Nov 9, 2025

Would really like this merged before v0.11.1. Tool calling in Qwen is one of the things that lag behind SGLang recently.

Currently, a workaround is to use the deepseek_r1 reasoning parser while retaining the hermes tool parser.

@chaunceyjiang chaunceyjiang self-assigned this Nov 10, 2025
@chaunceyjiang
Copy link
Collaborator

@baonudesifeizhai Qwen3-VL-32B does not support reasoning, so there's no need to set --reasoning-parser qwen3.

Have you tested whether the results are correct without using the --reasoning-parser argument?

@baonudesifeizhai
Copy link
Contributor Author

without using the --reasoning-parser argument seems works well....

@baonudesifeizhai Qwen3-VL-32B does not support reasoning, so there's no need to set --reasoning-parser qwen3.

Have you tested whether the results are correct without using the --reasoning-parser argument?

@chaunceyjiang
Copy link
Collaborator

chaunceyjiang commented Nov 10, 2025

without using the --reasoning-parser argument seems works well....

Yes. Qwen3-VL-32B-Instruct does not support reasoning.

@chaunceyjiang
Copy link
Collaborator

Qwen3-VL-32B-Thinking support reasoning.

@ehfd
Copy link

ehfd commented Nov 15, 2025

@baonudesifeizhai It seems that the sign-off is missing in one of your commits: https://github.com/vllm-project/vllm/pull/28330/checks?check_run_id=54885615468

@ehfd
Copy link

ehfd commented Nov 15, 2025

@chaunceyjiang Otherwise, can we get this merged before v0.11.1? Using this for Qwen3-VL-235B-A22B-Thinking.

@chaunceyjiang
Copy link
Collaborator

Can you paste the result without using the --async-scheduling parameter?

@baonudesifeizhai
Copy link
Contributor Author

without using the --async-scheduling works fine....
(.venv) root@df308fb7b11d:~/vllm# python test_tool_call_streaming.py
=== Testing with streaming ===
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id='chatcmpl-tool-a5fb728d638b4b78815d9efa6cc069e0', function=ChoiceDeltaToolCallFunction(arguments=None, name='search_products_general'), type='function')]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='{"description": "', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='pc', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' hp', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='1', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='6', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='gb', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ram', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='",', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' "', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='config', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='":', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' {"', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='num', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='_results', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='":', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments=' ', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='5', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='}}', name=None), type=None)]
Tool calls chunk: [ChoiceDeltaToolCall(index=0, id=None, function=ChoiceDeltaToolCallFunction(arguments='', name=None), type=None)]

=== Testing without streaming ===
Response: ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-4fc128eb676c4ebabbab16e140c96206', function=Function(arguments='{"description": "pc hp 16gb ram", "config": {"num_results": 5}}', name='search_products_general'), type='function')], reasoning=None, reasoning_content=None)
Tool calls: [ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-4fc128eb676c4ebabbab16e140c96206', function=Function(arguments='{"description": "pc hp 16gb ram", "config": {"num_results": 5}}', name='search_products_general'), type='function')]
(.venv) root@df308fb7b11d:~/vllm#

Can you paste the result without using the --async-scheduling parameter?

@chaunceyjiang
Copy link
Collaborator

Hi @baonudesifeizhai, what I meant is: without using --async-scheduling, does the issue described in your PR #28297 still exist? Also, it seems that #28297 has already been closed.

I tested tool_call_streaming locally with the latest main branch, and it works correctly. As I mentioned earlier, Qwen/Qwen3-VL-32B-Instruct does not support reasoning, so there's no need to specify the parameter "--reasoning-parser", "qwen3".

Add early detection of tool call tokens in streaming mode
Skip reasoning phase when tool calls are present
Fixes issue #28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser

So I'm now quite confused: which issue does your PR actually resolve?


@chaunceyjiang
Copy link
Collaborator

@chaunceyjiang Otherwise, can we get this merged before v0.11.1? Using this for Qwen3-VL-235B-A22B-Thinking.

@ehfd

Could you describe your issue in detail? And could you also provide the vLLM command you used?

@baonudesifeizhai
Copy link
Contributor Author

baonudesifeizhai commented Nov 17, 2025

not sovled anything right now....so if he didnot have other problem i will close this pr....

Hi @baonudesifeizhai, what I meant is: without using --async-scheduling, does the issue described in your PR #28297 still exist? Also, it seems that #28297 has already been closed.

I tested tool_call_streaming locally with the latest main branch, and it works correctly. As I mentioned earlier, Qwen/Qwen3-VL-32B-Instruct does not support reasoning, so there's no need to specify the parameter "--reasoning-parser", "qwen3".

Add early detection of tool call tokens in streaming mode
Skip reasoning phase when tool calls are present
Fixes issue #28297 where hermes tool parser fails in streaming mode when used together with qwen3 reasoning parser

So I'm now quite confused: which issue does your PR actually resolve?

@ehfd
Copy link

ehfd commented Nov 17, 2025

@chaunceyjiang The issue is that https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 and https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 had substantial problems in this precise situation with tool calling in clients such as Cline/Roo Code.

When reasoning is enabled with the tool calling also enabled, the <think> tool calls are exposed bare, with all the thinking content exposed together with <think></think>.

The solution was to use the deepseek_r1 reasoning parser instead of qwen3, but there is potential for this to work properly in qwen3 as well.

@ehfd
Copy link

ehfd commented Nov 17, 2025

What doesn't work:

pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}'

What works:

pip install 'qwen-vl-utils[decord]==0.0.14' && python3 -m vllm.entrypoints.openai.api_server --port 5000 --host 0.0.0.0 --download-dir /workspace/.cache/huggingface/hub --model Qwen/Qwen3-VL-235B-A22B-Thinking-FP8 --tensor-parallel-size 4 --trust-remote-code --enable-chunked-prefill --enable-prefix-caching --max-num-seqs 128 --gpu-memory-utilization 0.98 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1 --mm-processor-cache-gb 0 --mm-encoder-tp-mode data --media-io-kwargs '{"video": {"num_frames": -1}}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants