Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) #1304

juanmichelini · 2025-12-03T00:43:23Z

Remove early return that was causing conversations to terminate prematurely when the LLM produced content, even when tool calls were still being processed. This ensures the conversation continues properly through the full execution flow.

This change removes the problematic code block that was checking has_content and immediately finishing the conversation, which prevented proper processing of tool calls and other agent actions.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6eba439-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6eba439-python \
  ghcr.io/openhands/agent-server:6eba439-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6eba439-golang-amd64
ghcr.io/openhands/agent-server:6eba439-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6eba439-golang-arm64
ghcr.io/openhands/agent-server:6eba439-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6eba439-java-amd64
ghcr.io/openhands/agent-server:6eba439-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6eba439-java-arm64
ghcr.io/openhands/agent-server:6eba439-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6eba439-python-amd64
ghcr.io/openhands/agent-server:6eba439-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:6eba439-python-arm64
ghcr.io/openhands/agent-server:6eba439-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:6eba439-golang
ghcr.io/openhands/agent-server:6eba439-java
ghcr.io/openhands/agent-server:6eba439-python

About Multi-Architecture Support

Each variant tag (e.g., 6eba439-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 6eba439-python-amd64) are also available if needed

Remove early return that was causing conversations to terminate prematurely when the LLM produced content, even when tool calls were still being processed. This ensures the conversation continues properly through the full execution flow. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-12-03T07:35:47Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2025-12-03T07:43:21Z

🧪 Integration Tests Results

Overall Success Rate: 89.5%
Total Cost: $1.34
Models Tested: 5
Timestamp: 2025-12-03 07:43:16 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_gpt_5_mini_2025_08_07: 📥 View & Download Logs

📊 Summary

Model	Success Rate	Tests Passed	Skipped	Total Tests	Cost
litellm_proxy_claude_sonnet_4_5_20250929	62.5%	5/8	0	8	$0.44
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	7/7	1	8	$0.49
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	8/8	0	8	$0.31
litellm_proxy_deepseek_deepseek_chat	100.0%	7/7	1	8	$0.05
litellm_proxy_gpt_5_mini_2025_08_07	87.5%	7/8	0	8	$0.04

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 62.5% (5/8)
Total Cost: $0.44
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_0c6f36a_sonnet_run_N8_20251203_073610

Failed Tests:

t05_simple_browsing: Agent did not find the answer. Response: ... (Cost: $0.04)
t06_github_pr_browsing: No final answer found from agent. Events: 9, LLM messages: 4 (Cost: $0.10)
t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: (Cost: $0.04)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 100.0% (7/7)
Total Cost: $0.49
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_0c6f36a_kimi_k2_run_N8_20251203_073615
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.31
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_0c6f36a_gemini_3_pro_run_N8_20251203_073615

litellm_proxy_deepseek_deepseek_chat

Success Rate: 100.0% (7/7)
Total Cost: $0.05
Run Suffix: litellm_proxy_deepseek_deepseek_chat_0c6f36a_deepseek_run_N8_20251203_073615
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

Success Rate: 87.5% (7/8)
Total Cost: $0.04
Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_0c6f36a_gpt5_mini_run_N8_20251203_073615

Failed Tests:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000 and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?... (Cost: $0.0033)

enyst · 2025-12-03T08:43:17Z

@juanmichelini It might be worth looking into these logs, not sure what happens here with Sonnet getting 5/8. 5/8 is really surprising

Or gpt-5-mini:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000/ and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?...

Looks like it just talks to the user, so a content response.

Maybe we need the fake user message...?

juanmichelini · 2025-12-03T13:40:44Z

@juanmichelini It might be worth looking into these logs, not sure what happens here with Sonnet getting 5/8. 5/8 is really surprising

Or gpt-5-mini:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000/ and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?...

Looks like it just talks to the user, so a content response.

Maybe we need the fake user message...?

Hey Engel! I've uploaded yesterday logs here https://drive.google.com/drive/folders/1KMAq14ztG8-ug6aLVWDoGR6zp6ifVlHF I'm doing small runs (10~20 issues) but the amount of empty patches got is pretty consistent with and without fix.

neubig · 2025-12-05T16:54:48Z

Hey @juanmichelini , within the logs, are there any particular traces that are indicative of the problem in the original code? I'd like to take a closer look to better understand the problem.

github-actions · 2025-12-05T16:56:49Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2025-12-05T17:01:24Z

🧪 Integration Tests Results

Overall Success Rate: 86.8%
Total Cost: $1.11
Models Tested: 5
Timestamp: 2025-12-05 17:01:16 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_gpt_5_mini_2025_08_07: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Success Rate	Tests Passed	Skipped	Total Tests	Cost
litellm_proxy_deepseek_deepseek_chat	100.0%	7/7	1	8	$0.05
litellm_proxy_gpt_5_mini_2025_08_07	87.5%	7/8	0	8	$0.07
litellm_proxy_moonshot_kimi_k2_thinking	85.7%	6/7	1	8	$0.10
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	8/8	0	8	$0.39
litellm_proxy_claude_sonnet_4_5_20250929	62.5%	5/8	0	8	$0.49

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

Success Rate: 100.0% (7/7)
Total Cost: $0.05
Run Suffix: litellm_proxy_deepseek_deepseek_chat_b6e3767_deepseek_run_N8_20251205_165711
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

Success Rate: 87.5% (7/8)
Total Cost: $0.07
Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_b6e3767_gpt5_mini_run_N8_20251205_165713

Failed Tests:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000 and extracted the displayed answer. Next steps: ask the user if they want me to simulate clicking the button in a headless browser, fetch other endpoints, or make any mod... (Cost: $0.0047)

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 85.7% (6/7)
Total Cost: $0.10
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_b6e3767_kimi_k2_run_N8_20251205_165711
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello: Shell script is not executable (Cost: $0.02)

litellm_proxy_vertex_ai_gemini_3_pro_preview

Success Rate: 100.0% (8/8)
Total Cost: $0.39
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_b6e3767_gemini_3_pro_run_N8_20251205_165714

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 62.5% (5/8)
Total Cost: $0.49
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_b6e3767_sonnet_run_N8_20251205_165716

Failed Tests:

t05_simple_browsing: Agent did not find the answer. Response: ... (Cost: $0.04)
t06_github_pr_browsing: No final answer found from agent. Events: 9, LLM messages: 4 (Cost: $0.09)
t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: (Cost: $0.04)

enyst · 2025-12-05T17:12:24Z

In some conversations uploaded by Juan, I see:
gpt-5-codex MessageEvent

The conversation ended, with a MessageEvent. This MessageEvent has content, so the LLM is speaking to the user and waiting for the user. That's the kind of case, it seems to me, for which we had created and sent a fake user message in V0, to tell it to continue working.

Related, but maybe unnecessary, in the codex-cli, they have this text for GPT-5:

You are a coding agent. Please keep going until the query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. Autonomously resolve the query to the best of your ability, using the tools available to you, before coming back to the user. Do NOT guess or make up an answer.

However, this text doesn't exist in the system prompts for codex variants. I assume it means that it might not be needed for gpt-5-codex (although...it seems to be what we see), or maybe something else may have taken its place in their SWE-bench instruction, as opposed to system message.

blacksmith-sh · 2025-12-06T12:57:58Z

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

openhands-ai · 2025-12-06T16:12:26Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1304 at branch `fix-premature-conversation-termination-clean`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

neubig · 2025-12-06T16:12:58Z

Thanks @enyst . @xingyaoww: based on your previous experience, I wonder what you think we should do here. Should we re-implement the fake user message, modify the swe-bench prompt, something else?

enyst · 2025-12-08T13:38:49Z

Related:
I think Simon is seeing the same thing I mentioned above, MessageEvent where we mark the conversation FINISHED

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1351

We could try this PR, if you'd like, as it is, but I think if we send back the last message as 'assistant' role, we get a 400, so we need a 'user' role message, which is what 'fake user message' was doing.

xingyaoww · 2025-12-08T15:18:47Z

How about we solve this issue #1351 engel mentioned, and force agent to emit FinishAction when actually done, and send back "fake user message" when it sends MessageEvent?

juanmichelini · 2025-12-09T16:05:39Z

This issue happens only when benchmarking or testing models GPT 5 Codex and GLM 4.6.
That is when there is no user to ask the agent to continue.

Those two models behave differently than Claude family and the current fix fails with the tests for Claude Sonnet 4.
As Engel mentions there is a related issue that fails with Claude. #1351

(Side note: Something else that might be related: unlike GPT 5 Codex, GPT 5 gives patches in most cases but only in 20% of the cases they show a FinishAction. Which makes GPT 5 more costly to run when doing multiple iterations. See the benchmark sheet and compare both GPT 5.
)

Changing the conditions for FinishAction might impact all model evaluations, so I do not think we should merge the fix as is, but we could:

Add some LLM specific logic either in the agent or the system prompts for GPT 5 Codex and GLM 4.6.
The tests would work, other models would not be affected when benchmarking.
Keep the fix but test differently. See if the tests can be modified to work with claude, also run swebench with claude on this fix to make sure there is no regression.
Fix [BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1351 a try, and also check that it fixes the empty patches in GPT 5 Codex and GLM 4.6.
In the meantime, we could evaluate GPT 5 Codex and GLM 4.6 with this fix without merging.

enyst · 2025-12-11T01:24:22Z

I ran this PR change with a hello world-like script on gpt-5-mini:


Message from Agent 

Done — I created /Users/enyst/repos/oh-benchmarks/FACTS.txt with three facts about the project. Contents:

1. This repository provides benchmark evaluation infrastructure for OpenHands agents, including SWE-Bench, GAIA, Commit0, and OpenAgentSafety.
2. The project uses a local git submodule for the OpenHands software-agent-sdk (vendor/software-agent-sdk) and requires running `make build` to set up the environment.
3. Benchmarks support Docker (local) and Remote workspace types, enabling both local evaluations and scalable parallel runs via a runtime API.

Would you like me to commit this file to git?

Reasoning:

Tokens: ↑ input 29.57K • cache hit 85.72% •  reasoning 768 • ↓ output 1.13K • $ 0.0040

[event] MessageEvent

02:18:28 - LiteLLM:INFO: utils.py:1308 - Wrapper: Completed Call, calling success_handler

[12/11/25 02:18:28] WARNING  LLM produced empty response - continuing agent loop                                                                                                                              
agent.py:262

Message from Agent 

[no text content]

Tokens: ↑ input 39.33K • cache hit 64.45% •  reasoning 768 • ↓ output 1.14K • $ 0.0064

[event] MessageEvent

02:18:29 - LiteLLM:INFO: utils.py:1308 - Wrapper: Completed Call, calling success_handler

[12/11/25 02:18:29] WARNING  LLM produced empty response - continuing agent loop                                                                                                                              
agent.py:262

Message from Agent 

[no text content]

Tokens: ↑ input 49.08K • cache hit 51.63% •  reasoning 768 • ↓ output 1.14K • $ 0.0089

[event] MessageEvent

[12/11/25 02:18:29] WARNING  Stuck pattern detected.                                                                                                                                                          

local_conversation.py:305
Execution status after run: ConversationExecutionStatus.STUCK

LLM telemetry logs: llm_logs
  - llm_logs/openai__gpt-5-mini-1765415886.365-04a6.json
  - llm_logs/openai__gpt-5-mini-1765415889.401-fea7.json
  - llm_logs/openai__gpt-5-mini-1765415900.334-47b6.json
  - llm_logs/openai__gpt-5-mini-1765415906.747-dd0d.json
  - llm_logs/openai__gpt-5-mini-1765415908.070-d94f.json
  - llm_logs/openai__gpt-5-mini-1765415909.144-a709.json

Metrics snapshot: {'model_name': 'openai/gpt-5-mini', 'accumulated_cost': 0.00885035, 'max_budget_per_task': None, 'accumulated_token_usage': {'model': 'openai/gpt-5-mini', 'prompt_tokens': 49083, 
'completion_tokens': 1141, 'cache_read_tokens': 25344, 'cache_write_tokens': 0, 'reasoning_tokens': 768, 'context_window': 272000, 'per_turn_token': 9762, 'response_id': ''}}

enyst · 2025-12-11T01:33:10Z

@juanmichelini Thank you for the details, now I understand better why you were linking fake_user_message to a lot of evals.

But... I think we need to fix this... I just don't see an alternative. I think this is bug, so it's not optional 😅

I mean, IMHO, this PR is still incomplete for running evals, because:

when the LLM responds with text, not tool calls
the SDK sets conversation as FINISHED (this PR fixes this)
the benchmark code, afaik, will do nothing? If it does nothing, as you see above, the LLM can get STUCK. 😓

enyst · 2025-12-11T03:43:43Z

@juanmichelini Please allow me to note, I looked into this and had the agent reproduce, and I think you're correct this is only because we're missing a user message to tell the LLM to continue.

The answer to the question:
"how often does an LLM respond with text when we would like tool calls?"

is LLM dependent:

it seems GPT-5 does it towards the end of a task
GPT-5-codex variants do that at the beginning of a task, too
- I think the LLM asks the user if they like the plan (after it used task_tracking tool)
- it doesn't want to run wild, before the user approves the plan 😅
- which is why we see some empty patches (it didn't start working yet)
GLM (maybe, not sure)
Sonnet does it very rarely these days, but it does it too:
- we do know that Sonnet calls tools super-easily, and does stuff with tool calls all the time, without stopping to ask the user what to do.
- this can actually be an undesirable behavior and we want to test for it! Because Sonnet might start calling editing tools and "solve" a task, even when the user asked a question, not a task
- however, to note, Sonnet has asked the user in the past, in particular I remember we had a lot of eval trajectories with MessageActions, in a time when we had defined FinishAction with some partial parameter. Then Sonnet didn't like the FinishAction either, and it was talking to the user instead: 🤷 😂
- [Bug]: The LLM doesn't understand the Finish tool very well OpenHands#6406
- same for issue 1351 linked above.

IMHO, we simply need to port into benchmarks repo the fake user message code.

I don't think we can avoid it, otherwise the runs stop earlier than normal. Which means likely with worse results than they potentially have... 🤔 For any LLM.

(On a side note, I also think we will need to account for the behavior change in this PR in the CLI headless mode too).

Please note:
It seems to me that if we run evals on this PR without fixing it in benchmarks, LiteLLM might interfere with the results: for example in this benchmarks issue, I see:

'modify_params': True

In this case, liteLLM will send "continue" from the user, to the LLM API, in certain circumstances. Please see details here:

Use litellm's modify_params with default True OpenHands#5609

In the past, we had modify_params set to False in all evals. So that we have reproducibility and full transparency in what messages we send to the LLM, without interference. I'd suggest maybe we could fix this too.

enyst · 2025-12-11T04:06:47Z

test

How about,

a script to reproduce, without Finish tool, such as a hello world example
running examples from examples/ directory (we have a label defined, run examples)
running integration tests (also a label)
all without Finish tool, for simplicity
(and maybe with, too)

Without the tool, the LLM has to respond to the user in plain text, which allows us to see the behavior of the SDK in that case.

juanmichelini · 2025-12-11T19:53:43Z

hey @enyst !

the benchmark code, afaik, will do nothing? If it does nothing, as you see above, the LLM can get STUCK. 😓

I think you are right. This fix improves how many patches we get with Codex from 20~% to 80~% but still that is far from 98~% or so which is the usual number.

IMHO, we simply need to port into benchmarks repo the fake user message code.

By port, do you suggest we close this PR without merging and only do a similar PR on the benchmarks repo?

That is the benchmark logic would detect that the agent finished, check if it is a valid reason to finish, if not send a fake user message like "continue".

enyst

Just for clarity, IMHO this is important to fix, and I think we need to somehow do it coordinated with a benchmarks repo fix. This seems to be the case regardless of what the fix is, probably

IMHO Xingyao's suggestion in this thread is the right fix, but even if the fix is different I'd love to have some real-world LLM test (maybe integration test, or maybe changes to examples/) ... it is a bug

juanmichelini · 2025-12-11T20:12:17Z

OK, I''l create a separate issue for the benchmark repo.

Regarding this PR, we can add tests just for this case, but I'm worried about the tests that are failing currently. I have not found a fix that both fixes this and passes the tests.

enyst · 2025-12-11T20:17:32Z

@OpenHands This CI job is failing https://github.com/OpenHands/software-agent-sdk/actions/runs/19969962009/job/57271889637?pr=1304

Please look into it, but don't fix it yet: instead, tell me for each of those 11 tests, what was the issue. Also, analyze and tell me if each test of those 11 would still be failing and why, if instead of setting any conversation without tool calls to FINISHED, we set it to IDLE.

openhands-ai · 2025-12-11T20:17:42Z

I'm on it! enyst can track my progress at all-hands.dev

openhands-ai · 2025-12-11T20:21:58Z

Summary of findings for the failing CI job on PR #1304

What I did

Retrieved and parsed the GitHub Actions job logs for the linked run.
Identified the 11 failing tests and inspected the assertions and emitted logs.
Analyzed whether each would still fail if content-only assistant responses were set to IDLE instead of FINISHED.

Root cause overview

All failures stem from content-only assistant responses no longer being treated as terminal. Without the early FINISHED, run() continues, triggers stuck detection (“Stuck pattern detected.”), and leaves the conversation in STUCK. In one case, this also led to extra LLM calls (3 vs expected 1).
Changing content-only to IDLE wouldn’t satisfy the tests either, because every affected test explicitly expects FINISHED.

Per-test diagnosis and “IDLE instead of FINISHED” analysis

tests/sdk/agent/test_reasoning_only_responses.py::test_agent_finishes_after_content_only_response

CI failure: llm._call_count was 3 (expected 1) and status ended up STUCK instead of FINISHED. The loop didn’t terminate on content and got stuck.
If content-only → IDLE: The FINISHED assertion would still fail (though the call count would likely become 1).

tests/sdk/conversation/local/test_agent_status_transition.py::test_execution_status_transitions_to_running_from_idle

CI failure: Expected FINISHED; got STUCK after a content-only message.
If content-only → IDLE: Still fails because the test asserts FINISHED.

tests/sdk/conversation/local/test_agent_status_transition.py::test_execution_status_is_running_during_execution_from_idle

CI failure: Tool call then content-only; expected final FINISHED; got STUCK.
If content-only → IDLE: Still fails due to expecting FINISHED.

tests/sdk/conversation/local/test_agent_status_transition.py::test_execution_status_transitions_to_running_from_paused

CI failure: Expected FINISHED; got STUCK after content-only final message.
If content-only → IDLE: Still fails (expects FINISHED).

tests/sdk/conversation/local/test_agent_status_transition.py::test_execution_status_transitions_from_waiting_for_confirmation

CI failure: After confirming, final step was content-only; expected FINISHED; got STUCK.
If content-only → IDLE: Still fails (expects FINISHED).

tests/sdk/conversation/local/test_agent_status_transition.py::test_execution_status_finished_to_idle_to_running

CI failure: First run expected FINISHED; got STUCK due to content-only.
If content-only → IDLE: Still fails at the first FINISHED assertion.

tests/sdk/conversation/local/test_agent_status_transition.py::test_run_exits_immediately_when_already_finished

CI failure: First run didn’t reach FINISHED (ended STUCK), breaking the “second run exits immediately without LLM call” guarantee.
If content-only → IDLE: Still fails because it requires the first run to reach FINISHED.

tests/sdk/conversation/local/test_confirmation_mode.py::TestConfirmationMode::test_message_only_in_confirmation_mode_does_not_wait

CI failure: In AlwaysConfirm mode, a pure assistant message should not wait and should FINISH; got STUCK.
If content-only → IDLE: Still fails due to FINISHED expectation.

tests/sdk/conversation/local/test_confirmation_mode.py::TestConfirmationMode::test_action_then_confirm_or_reject[False]

CI failure: After confirming, expected FINISHED; got STUCK (final content-only).
If content-only → IDLE: Still fails due to FINISHED expectation.

tests/sdk/conversation/local/test_conversation_pause_functionality.py::TestPauseFunctionality::test_pause_during_normal_execution

CI failure: After pause->run, expected FINISHED; got STUCK due to content-only not terminating.
If content-only → IDLE: Still fails due to FINISHED expectation.

tests/sdk/conversation/local/test_conversation_pause_functionality.py::TestPauseFunctionality::test_resume_paused_agent

CI failure: Same pattern; expected FINISHED; got STUCK.
If content-only → IDLE: Still fails due to FINISHED expectation.

Bottom line

The 11 failures are all caused by content-only assistant responses no longer ending the run loop. Under a policy of setting such responses to IDLE, all 11 tests would still fail because each asserts FINISHED as the final status. The only improvement would be avoiding STUCK and, in at least one case, reducing extra LLM calls, but the tests would still not pass due to their explicit FINISHED assertions.

_{View full conversation}

enyst · 2025-12-11T21:01:55Z

OpenHands-GPT-5:

Recommendation: set IDLE and stop the run loop

Use IDLE when the LLM returns a plain text reply (no tool calls), and immediately break out of BaseConversation.run(). This yields control back to the user after emitting the MessageEvent.
Do not use PAUSED here. PAUSED is user-initiated (or “defer” in confirmation mode) and implies the user should /resume; it would be misleading for a simple assistant message.
Do not use FINISHED here if your intent is “not done with a task, just giving the user a response.” FINISHED means “task complete” and makes subsequent run() a no-op until a new user message resets to IDLE.

Why IDLE fits best

Semantics: “Ready for next input.” A text-only assistant message typically ends a single agent step and hands the turn back to the user.
SDK behavior: The assistant message is already surfaced as a MessageEvent, so the user sees the reply. If we set IDLE and make run() break on IDLE, we avoid extra LLM calls and stuck detection.
CLI behavior: The CLI doesn’t need special handling for IDLE. After run() returns, the prompt is shown. The user can type a follow-up or a command. Using PAUSED would incorrectly require /resume; using FINISHED would signal “task done,” which you’re trying to avoid.

What needs changing

In Agent.step, after emitting a MessageEvent for content-only responses, set state.execution_status = IDLE.
In BaseConversation.run, treat IDLE as a terminal condition for the current run (break). Today it breaks on FINISHED, PAUSED, and STUCK. Add IDLE to that break set when it’s the post-step result.
Update tests that currently assert FINISHED in content-only cases to expect IDLE.

Trade-offs and edge cases

If you want “auto-continue” after a content-only reply (ask the model again without user input), then IDLE+break won’t do that; but auto-continue is what caused repeated LLM calls and STUCK. The safer default is to yield to the user.
If you later want a “continue” UX without new user text, the CLI could add a “/continue” that simply calls conversation.run() again from IDLE. For now, prompting the user is reasonable.

Summary

Set IDLE on text-only assistant replies and stop the run loop. This avoids premature FINISHED, avoids looping into STUCK, and cleanly returns control to the user with the model’s message already displayed.

enyst added the integration-test Runs the integration tests and comments the results label Dec 3, 2025

juanmichelini changed the title ~~Fix premature conversation termination when LLM produces content~~ Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) Dec 5, 2025

Merge branch 'main' into fix-premature-conversation-termination-clean

b6e3767

enyst added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Dec 5, 2025

blacksmith-sh bot requested a review from raymyers December 6, 2025 12:57

neubig requested review from neubig and removed request for raymyers December 6, 2025 16:12

enyst mentioned this pull request Dec 8, 2025

[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1351

Open

enyst mentioned this pull request Dec 11, 2025

Auto-user message in headless mode OpenHands/OpenHands-CLI#196

Open

enyst requested changes Dec 11, 2025

View reviewed changes

Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) #1304

Are you sure you want to change the base?

Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) #1304

Conversation

juanmichelini commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_gpt_5_mini_2025_08_07

Uh oh!

enyst commented Dec 3, 2025

Uh oh!

juanmichelini commented Dec 3, 2025

Uh oh!

neubig commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_gpt_5_mini_2025_08_07

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_vertex_ai_gemini_3_pro_preview

litellm_proxy_claude_sonnet_4_5_20250929

Uh oh!

enyst commented Dec 5, 2025

Uh oh!

blacksmith-sh bot commented Dec 6, 2025

Uh oh!

openhands-ai bot commented Dec 6, 2025

Uh oh!

neubig commented Dec 6, 2025

Uh oh!

enyst commented Dec 8, 2025

Uh oh!

xingyaoww commented Dec 8, 2025

Uh oh!

juanmichelini commented Dec 9, 2025

Uh oh!

enyst commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst commented Dec 11, 2025

Uh oh!

enyst commented Dec 11, 2025

Uh oh!

juanmichelini commented Dec 11, 2025

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

juanmichelini commented Dec 11, 2025

Uh oh!

enyst commented Dec 11, 2025

Uh oh!

openhands-ai bot commented Dec 11, 2025

Uh oh!

openhands-ai bot commented Dec 11, 2025

Uh oh!

enyst commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

juanmichelini commented Dec 3, 2025 •

edited by github-actions bot

Loading

enyst commented Dec 11, 2025 •

edited

Loading

enyst commented Dec 11, 2025 •

edited

Loading