-
Notifications
You must be signed in to change notification settings - Fork 79
Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Remove early return that was causing conversations to terminate prematurely when the LLM produced content, even when tool calls were still being processed. This ensures the conversation continues properly through the full execution flow. Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🧪 Integration Tests ResultsOverall Success Rate: 89.5% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_gpt_5_mini_2025_08_07
Failed Tests:
|
|
@juanmichelini It might be worth looking into these logs, not sure what happens here with Sonnet getting 5/8. 5/8 is really surprising Or gpt-5-mini:
Looks like it just talks to the user, so a Maybe we need the |
Hey Engel! I've uploaded yesterday logs here https://drive.google.com/drive/folders/1KMAq14ztG8-ug6aLVWDoGR6zp6ifVlHF I'm doing small runs (10~20 issues) but the amount of empty patches got is pretty consistent with and without fix. |
|
Hey @juanmichelini , within the logs, are there any particular traces that are indicative of the problem in the original code? I'd like to take a closer look to better understand the problem. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🧪 Integration Tests ResultsOverall Success Rate: 86.8% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_gpt_5_mini_2025_08_07
Failed Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
Failed Tests:
litellm_proxy_vertex_ai_gemini_3_pro_preview
litellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
|
|
In some conversations uploaded by Juan, I see: The conversation ended, with a Related, but maybe unnecessary, in the codex-cli, they have this text for GPT-5:
However, this text doesn't exist in the system prompts for |
|
[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help! |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
Thanks @enyst . @xingyaoww: based on your previous experience, I wonder what you think we should do here. Should we re-implement the fake user message, modify the swe-bench prompt, something else? |
|
Related:
We could try this PR, if you'd like, as it is, but I think if we send back the last message as 'assistant' role, we get a 400, so we need a 'user' role message, which is what 'fake user message' was doing. |
|
How about we solve this issue #1351 engel mentioned, and force agent to emit FinishAction when actually done, and send back "fake user message" when it sends |
|
This issue happens only when benchmarking or testing models GPT 5 Codex and GLM 4.6. Those two models behave differently than Claude family and the current fix fails with the tests for Claude Sonnet 4. (Side note: Something else that might be related: unlike GPT 5 Codex, GPT 5 gives patches in most cases but only in 20% of the cases they show a FinishAction. Which makes GPT 5 more costly to run when doing multiple iterations. See the benchmark sheet and compare both GPT 5. Changing the conditions for FinishAction might impact all model evaluations, so I do not think we should merge the fix as is, but we could:
|
|
I ran this PR change with a |
|
@juanmichelini Thank you for the details, now I understand better why you were linking fake_user_message to a lot of evals. But... I think we need to fix this... I just don't see an alternative. I think this is bug, so it's not optional 😅 I mean, IMHO, this PR is still incomplete for running evals, because:
|
|
@juanmichelini Please allow me to note, I looked into this and had the agent reproduce, and I think you're correct this is only because we're missing a user message to tell the LLM to continue. The answer to the question: is LLM dependent:
IMHO, we simply need to port into I don't think we can avoid it, otherwise the runs stop earlier than normal. Which means likely with worse results than they potentially have... 🤔 For any LLM. (On a side note, I also think we will need to account for the behavior change in this PR in the CLI headless mode too). Please note: In this case, liteLLM will send "continue" from the user, to the LLM API, in certain circumstances. Please see details here: In the past, we had |
How about,
Without the tool, the LLM has to respond to the user in plain text, which allows us to see the behavior of the SDK in that case. |
|
hey @enyst !
I think you are right. This fix improves how many patches we get with Codex from 20~% to 80~% but still that is far from 98~% or so which is the usual number.
By port, do you suggest we close this PR without merging and only do a similar PR on the benchmarks repo?
|
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity, IMHO this is important to fix, and I think we need to somehow do it coordinated with a benchmarks repo fix. This seems to be the case regardless of what the fix is, probably
IMHO Xingyao's suggestion in this thread is the right fix, but even if the fix is different I'd love to have some real-world LLM test (maybe integration test, or maybe changes to examples/) ... it is a bug
|
OK, I''l create a separate issue for the benchmark repo. Regarding this PR, we can add tests just for this case, but I'm worried about the tests that are failing currently. I have not found a fix that both fixes this and passes the tests. |
|
@OpenHands This CI job is failing https://github.com/OpenHands/software-agent-sdk/actions/runs/19969962009/job/57271889637?pr=1304 Please look into it, but don't fix it yet: instead, tell me for each of those 11 tests, what was the issue. Also, analyze and tell me if each test of those 11 would still be failing and why, if instead of setting any conversation without tool calls to FINISHED, we set it to IDLE. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Summary of findings for the failing CI job on PR #1304 What I did
Root cause overview
Per-test diagnosis and “IDLE instead of FINISHED” analysis
Bottom line
|
|
OpenHands-GPT-5: Recommendation: set IDLE and stop the run loop
Why IDLE fits best
What needs changing
Trade-offs and edge cases
Summary
|
Remove early return that was causing conversations to terminate prematurely when the LLM produced content, even when tool calls were still being processed. This ensures the conversation continues properly through the full execution flow.
This change removes the problematic code block that was checking
has_contentand immediately finishing the conversation, which prevented proper processing of tool calls and other agent actions.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:6eba439-pythonRun
All tags pushed for this build
About Multi-Architecture Support
6eba439-python) is a multi-arch manifest supporting both amd64 and arm646eba439-python-amd64) are also available if needed