Skip to content

Conversation

@bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Nov 7, 2025

Purpose

The output generated by gpt-oss models does not always strictly follow its expected harmony chat template format. This commonly - but not exclusively - happens when gpt-oss-120b generates refusals for content that violates its built-in safety guidelines.

To fix this, a non-strict mode was added to the openai-harmony library to allow attempted recovery of malformed message headers in the model output, such as a missing <|message|> special token before the assistant text.

This will resolve some cases where the error
openai_harmony.HarmonyError: unexpected tokens remaining in message header was previously thrown. It will not resolve all of those, as not every malformed message output can be recovered. Other ongoing work around using structured output for the Harmony format can help prevent these kinds of things in the first place, once that work lands and in the cases where the user and/or server decide to enable it.

I believe it should be safe to enable this non-strict mode by default in vLLM, as the code paths that enables in the openai-harmony library only gets triggered once it's already detected malformed output. So, there shouldn't be any performance penalty in the common case. And, in the event that the malformed content cannot be properly recovered, the openai-harmony library will still end up throwing an error.

This is related to #23567 as well as openai/harmony#80.

Test Plan

I added a new test to verify the refusal parsing in test_harmony_utils.py. I also run test_response_api_with_harmony.py locally, but skip the code interpreter tests because my dev machine is not setup properly to run that particular one.

pytest tests/entrypoints/test_harmony_utils.py
pytest -k "not code_interpreter" tests/entrypoints/openai/test_response_api_with_harmony.py

Test Result

$ pytest -q --disable-warnings tests/entrypoints/test_harmony_utils.py
................                                                                  [100%]
16 passed, 2 warnings in 2.38s

$ pytest -q --disable-warnings -k "not code_interpreter" \
  tests/entrypoints/openai/test_response_api_with_harmony.py
................s........                                                         [100%]
24 passed, 1 skipped, 1 deselected, 3 warnings in 74.30s (0:01:14)

The output generated by gpt-oss models does not always strictly follow
its expected harmony chat template format. This commonly - but not
exclusively - happens when gpt-oss-120b generates refusals for content
that violates its built-in safety guidelines.

To fix this, a non-strict mode was added to the openai-harmony library
to allow attempted recovery of malformed message headers in the model
output, such as a missing `<|message|>` special token before the
assistant text.

This will resolve some cases where the error
`openai_harmony.HarmonyError: unexpected tokens remaining in message
header` was previously thrown. It will not resolve all of those, as not
every malformed message output can be recovered. Other ongoing work
around using structured output for the Harmony format can help prevent
these kinds of things in the first place, once that work lands and in
the cases where the user and/or server decide to enable it.

I believe it should be safe to enable this non-strict mode by default in
vLLM, as the code paths that enables in the openai-harmony library only
gets triggered once it's already detected malformed output. So, there
shouldn't be any performance penalty in the common case. And, in the
event that the malformed content cannot be properly recovered, the
openai-harmony library will still end up throwing an error.

This is related to vllm-project#23567 as well as openai/harmony#80.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a parsing issue with gpt-oss model outputs by upgrading the openai-harmony library and enabling its non-strict parsing mode. The change is well-justified, with the core logic modification being minimal and correctly targeted. The addition of a new test case, test_malformed_refusal_message, is excellent as it specifically validates the fix for the described malformed refusal messages. The dependency updates in requirements/common.txt and requirements/test.txt are consistent with the required library version. Overall, this is a solid bugfix that improves the robustness of handling gpt-oss outputs.

@njhill
Copy link
Member

njhill commented Nov 8, 2025

Thanks @bbrowning!

Might this help with some of the potentially flaky tests such as https://buildkite.com/vllm/ci/builds/38051#019a6063-c042-4667-bc3f-859390c7272d?

[2025-11-07T23:44:09Z] (APIServer pid=10583)   File "/usr/local/lib/python3.12/dist-packages/openai_harmony/__init__.py", line 627, in process
[2025-11-07T23:44:09Z] (APIServer pid=10583)     self._inner.process(token)
[2025-11-07T23:44:09Z] (APIServer pid=10583) openai_harmony.HarmonyError: Unknown role: assistantary

@bbrowning
Copy link
Contributor Author

@njhill I don't think this will do anything in that particular case, as on the surface that looks like a role of assistantary is being generated instead of assistant but how or why that's happening I'm unsure. If there's not already a bug to track that down and reproduce/fix, we should probably make one.

@alecsolder
Copy link
Contributor

Some random thoughts:

I think I do lean to not having this enabled by default at first, or if we do have it enabled by default, we need to:

  • Have validation on our side that the output message does in fact have enough metadata to be "complete"
  • Things that shouldn't be in the content aren't in the content (i.e you shouldn't see <|channel|> in the content because of this)
  • Have logging to know how often we are actually hitting the case where strict=False is preventing a crash

For example, if I deployed this, I'd have no way of understanding the impact of the changes. How many requests its "saving" from erroring for example.

I also think there is a difference between the impact of this change on single-turn responses API vs in the tool calling while loop. I think in single-turn responses API usage there is a better ability to "repair" harmony messages when they are translated to responses items and back into harmony messages, but for the tool calling loop it stays as harmony messages the entire time so these issues may compound silently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

4 participants