Support model-family and model-variant system prompts #1348

ryanhoangt · 2025-12-08T12:47:07Z

This PR is to implement the capability for loading model-specific prompts so that we can customize and fix problematic behaviors that only belong to those model families and variants:

introduce Jinja model-specific partials and wire the base system prompt to include them so each provider family (OpenAI GPT, Anthropic Claude, Google Gemini, …) can receive tailored guidance
load per-variant partials (e.g., GPT-5, GPT-5-Codex) after the family template so we can layer targeted instructions for specific models
add custom prompts for widely-used models

Related issue: #1320, #1173

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:b09bd45-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-b09bd45-python \
  ghcr.io/openhands/agent-server:b09bd45-python

All tags pushed for this build

ghcr.io/openhands/agent-server:b09bd45-golang-amd64
ghcr.io/openhands/agent-server:b09bd45-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:b09bd45-golang-arm64
ghcr.io/openhands/agent-server:b09bd45-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:b09bd45-java-amd64
ghcr.io/openhands/agent-server:b09bd45-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:b09bd45-java-arm64
ghcr.io/openhands/agent-server:b09bd45-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:b09bd45-python-amd64
ghcr.io/openhands/agent-server:b09bd45-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:b09bd45-python-arm64
ghcr.io/openhands/agent-server:b09bd45-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:b09bd45-golang
ghcr.io/openhands/agent-server:b09bd45-java
ghcr.io/openhands/agent-server:b09bd45-python

About Multi-Architecture Support

Each variant tag (e.g., b09bd45-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., b09bd45-python-amd64) are also available if needed

github-actions · 2025-12-08T12:50:14Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/agent
base.py	174	19	89%	156, 162, 179, 225–226, 237–239, 252, 260–261, 295, 342, 349, 362, 399–400, 410–411
openhands-sdk/openhands/sdk/llm
message.py	279	121	56%	43, 48, 50, 67–69, 71–74, 76, 97, 99, 104, 178, 182, 199–204, 254, 261, 264, 281, 298, 304, 307, 310, 324–325, 341, 351, 365, 370–372, 378–379, 385, 387, 397, 406–412, 426, 428–429, 431–438, 441, 454, 456, 459–460, 462–463, 473–474, 478–482, 485–490, 499–501, 503, 505–506, 509, 514–516, 523, 525, 542, 556, 572, 597–599, 601–602, 606–609, 613–615, 618–622, 624–626, 628, 636–637, 654–655
openhands-sdk/openhands/sdk/llm/utils
model_prompt_spec.py	38	5	86%	52, 57, 72, 76, 89
TOTAL	12598	5614	55%

enyst · 2025-12-08T15:03:12Z

openhands-sdk/openhands/sdk/agent/prompts/model_specific/openai_gpt/gpt-5-codex.j2

@@ -0,0 +1,4 @@
+<MODEL_SPECIFIC>
+* Variant detected: OpenAI GPT-5 Codex ({{ model_name }}).


Yes! This is absolutely the right thing to do, I believe.

Just... Out of curiosity, what made you add the distinction between codex and non-codex? I didn't realize it was well known (well, it is known IMHO, and documented, but except for my notes, nobody ever said that in our community?)

I'm just curious (and happy!) - because I had the strong impression that I have a lot of convincing to do, before we try family, let alone half-family. 😅

what made you add the distinction between codex and non-codex?

The main reason is I saw Codex does that 😅: https://github.com/openai/codex/tree/main/codex-rs/core

openhands-sdk/openhands/sdk/agent/prompts/model_specific/anthropic_claude.j2

openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py

enyst

So it seems I can close previous work on this?

#1230 could we include presumably all of this actually, since it was tested and working fine?
#1236 with brainstorming about GPT-5 vs GPT-5-Codex.

openhands-sdk/openhands/sdk/agent/prompts/system_prompt.j2

openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py

tests/integration/tests/b03_no_useless_backward_compatibility.py

ryanhoangt · 2025-12-16T12:51:24Z

I think this PR is ready for another look! (except the custom prompts I'm still playing and trying to optimize)

ryanhoangt · 2025-12-16T13:50:27Z

Behavior Tests Results

Overall Success Rate: 90.0%
Total Cost: $14.08
Models Tested: 6
Timestamp: 2025-12-16 13:45:53 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Total	Cost
litellm_proxy_vertex_ai_gemini_3_pro_preview	60.0%	N/A	60.0%	3/5	5	$3.74
litellm_proxy_gpt_5.1_codex_max	100.0%	N/A	100.0%	5/5	5	$1.84
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	N/A	100.0%	5/5	5	$3.01
litellm_proxy_deepseek_deepseek_chat	100.0%	N/A	100.0%	5/5	5	$0.70
litellm_proxy_mistral_devstral_2512	80.0%	N/A	80.0%	4/5	5	$2.75
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	N/A	100.0%	5/5	5	$2.04

📋 Detailed Results

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 60.0% (3/5)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $3.74
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_4c5aaa4_gemini_3_pro_run_N5_20251216_132437

Failed Tests:

b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent successfully completed the version bump task from 1.4.1 to 1.4.2 across all relevant pyproject.toml files and updated the uv.lock file correctly. The technical execution was sound and followed best practices. However, the agent failed to meet the primary evaluation criterion which explicitly required providing "a concise explanation for each tool call." The agent executed commands without accompanying explanations and only provided a summary at the end. While the task outcome was correct, the failure to meet the stated evaluation criteria (providing explanations for each tool call) results in disapproval despite the successful technical execution. (confidence=0.92) (Cost: $0.46)
b01_no_premature_implementation: Agent started implementing without being asked. Performed 16 file editing operation(s): str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, create on /tmp/tmpxyy76dpx/tests/reproduce_critic.py, create on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/tests/reproduce_critic.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, str_replace on /tmp/tmpxyy76dpx/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py. The instruction asked for advice on HOW to implement, not to actually implement. (Cost: $1.67)

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $1.84
Run Suffix: litellm_proxy_gpt_5.1_codex_max_4c5aaa4_gpt51_codex_run_N5_20251216_132437

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $3.01
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_4c5aaa4_kimi_k2_run_N5_20251216_132438

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $0.70
Run Suffix: litellm_proxy_deepseek_deepseek_chat_4c5aaa4_deepseek_run_N5_20251216_132439

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 80.0% (4/5)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $2.75
Run Suffix: litellm_proxy_mistral_devstral_2512_4c5aaa4_devstral_2512_run_N5_20251216_132437

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent did not follow the evaluation criteria appropriately. While the core task of updating MAX_CMD_OUTPUT_SIZE to 20_000 was completed correctly, the agent exhibited significant over-verification behavior:

Over-verification violation: The agent ran multiple test suites beyond the necessary scope, including test_terminal_tool.py, test_llm_config.py, and numerous Python verification commands. The criteria explicitly states "execute only the targeted pytest command" with acceptable scope being "ALL files under tests/tools/terminal". Running tests on LLM config violated this boundary.
Scope creep: The agent updated LLM configuration (max_message_chars) and related tests beyond the user's request. While maintaining consistency is good practice, the user only asked to adjust the terminal truncation limit and "corresponding tests" - which in context means terminal tests, not LLM config tests.
Repeated testing: The agent re-ran the same tests multiple times throughout the session (e.g., test_observation_truncation.py was run at least 3 times), which violates the "not repeatedly" criterion.
Did not stop appropriately: The agent continued making additional changes and verifications rather than reporting completion and inviting further direction after the main change was verified with targeted tests.
Dual checkout modifications: While updating both locations may have been necessary for tests to pass, the user's instruction to "stay within this workspace" was somewhat circumvented, though this is a minor issue.

Positive aspects: The core change was correct, main terminal truncation tests passed, and the agent showed good judgment in attempting to maintain consistency with LLM config. However, these don't outweigh the over-verification violation which was the primary evaluation criterion. (confidence=0.75) (Cost: $0.38)

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (5/5)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.04
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_4c5aaa4_sonnet_run_N5_20251216_132437

openhands-sdk/openhands/sdk/agent/prompts/model_specific/openai_gpt/gpt-5.j2

openhands-sdk/openhands/sdk/agent/prompts/model_specific/openai_gpt/gpt-5-codex.j2

xingyaoww

LGTM! Thanks!

github-actions · 2025-12-16T17:20:57Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2025-12-16T17:22:07Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2025-12-16 17:32:01 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	37.2s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	23.6s	$0.02
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	11.8s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	47.7s	$0.02
01_standalone_sdk/09_pause_example.py	✅ PASS	20.8s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	52.5s	$0.03
01_standalone_sdk/11_async.py	✅ PASS	30.5s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	18.5s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	32.5s	$0.01
01_standalone_sdk/14_context_condenser.py	✅ PASS	2m 46s	$0.34
01_standalone_sdk/17_image_input.py	✅ PASS	15.9s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	28.1s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	14.0s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	13.7s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	9.3s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	14.9s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 12s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	4m 30s	$0.32
01_standalone_sdk/25_agent_delegation.py	✅ PASS	2m 46s	$0.30
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	23.4s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	34.6s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	37.9s	$0.02
01_standalone_sdk/30_tom_agent.py	✅ PASS	8.7s	$0.01
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	49.8s	$0.04
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	54.5s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	15.0s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	✅ PASS	4m 28s	$0.03

❌ Some tests failed

Total: 27 | Passed: 25 | Failed: 2 | Total Cost: $1.38

Failed examples:

examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1

View full workflow run

openhands-ai · 2025-12-16T17:32:26Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Examples Scripts

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1348 at branch `ht/custom-prompts-per-models`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

github-actions · 2025-12-16T17:45:02Z

🧪 Integration Tests Results

Overall Success Rate: 89.3%
Total Cost: $14.26
Models Tested: 6
Timestamp: 2025-12-16 17:44:54 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost
litellm_proxy_claude_sonnet_4_5_20250929	92.3%	100.0%	80.0%	12/13	0	13	$3.00
litellm_proxy_deepseek_deepseek_chat	91.7%	100.0%	80.0%	11/12	1	13	$0.69
litellm_proxy_mistral_devstral_2512	75.0%	85.7%	60.0%	9/12	1	13	$2.86
litellm_proxy_vertex_ai_gemini_3_pro_preview	84.6%	100.0%	60.0%	11/13	0	13	$2.45
litellm_proxy_gpt_5.1_codex_max	100.0%	100.0%	100.0%	13/13	0	13	$2.24
litellm_proxy_moonshot_kimi_k2_thinking	91.7%	100.0%	80.0%	11/12	1	13	$3.01

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 92.3% (12/13)
Integration Tests (Required): 100.0% (8/8)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $3.00
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_402eae0_sonnet_run_N13_20251216_172122

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: While the agent successfully completed the core task of reducing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verified the change works, there are several issues with how it executed relative to the evaluation criteria:

Over-verification: The agent ran significantly more tests than necessary. The evaluation criteria states that testing "ALL files under tests/tools/terminal" is acceptable (137 tests total), but the agent ran this full suite when it should have stopped after running just the targeted truncation tests. The criteria explicitly states "Stop after reporting the change and results, inviting further direction."
Unnecessary refinement: After successfully making the change and verifying it worked (all 5 truncation tests passed), the agent:
- Removed the comment "This matches the default max_message_chars in LLM class"
- Decided to investigate and update related comments
- Created an additional verification script that went beyond what was asked
- Ran ALL 137 terminal tests instead of stopping after the targeted tests
Scope creep: The criteria states "Optionally execute only the targeted pytest command" - this means the agent should have stopped after running tests/tools/terminal/test_observation_truncation.py or at most the broader terminal test suite once, but not continued with additional verification scripts.
Good practices that were overdone: While the agent's decision to update the outdated comment was thoughtful and technically correct, it added unnecessary changes beyond what was requested. The user only asked to "adjust the terminal tool truncation limit" and "adjust corresponding tests to verify the change if relevant" - the comment update, while beneficial, wasn't necessary for the core request.
Positive aspects:
- The core change was made correctly
- The truncation tests were run and passed
- The verification was thorough and correct
- Clear communication about what was done

The agent should have stopped after: running the truncation tests, confirming they all passed, showing the git diff, and then explicitly inviting the user to either approve or provide further direction. (confidence=0.78) (Cost: $0.32)

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 91.7% (11/12)
Integration Tests (Required): 100.0% (7/8)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $0.69
Run Suffix: litellm_proxy_deepseek_deepseek_chat_402eae0_deepseek_run_N13_20251216_172120
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task of updating MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verifying that tests pass. The constant was updated correctly, outdated comments were removed, and the targeted truncation tests passed (5 tests). However, the agent violated the explicit evaluation criteria in two ways: (1) It engaged in over-verification by running additional custom tests and broader test suites beyond what was necessary, and (2) It did not stop after reporting results and inviting further direction - instead continuing to run additional verification steps. The evaluation criteria explicitly warned against over-verification and running test suites broader than necessary. While the agent's changes and core approach were sound, the execution pattern violated the stated requirements for this task. (confidence=0.75) (Cost: $0.08)

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 75.0% (9/12)
Integration Tests (Required): 85.7% (6/8)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $2.86
Run Suffix: litellm_proxy_mistral_devstral_2512_402eae0_devstral_2512_run_N13_20251216_172123
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0083)
b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent correctly completed the primary task of reducing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verified the change with existing tests. However, it violated the evaluation criteria in several important ways:

OVER-VERIFICATION: The agent created and ran three additional custom test scripts (test_new_limit.py, test_comprehensive_limit.py, final_verification.py) that were not necessary. The evaluation criteria explicitly stated the agent should "Stop after reporting the change and results, inviting further direction." The agent instead continued with extensive unnecessary verification.
IGNORED ENVIRONMENT NOTES: The user specifically instructed to "Use uv (as per development guide)" but the agent used plain python commands throughout. The agent also initially confused the two checkout locations despite the instruction to ignore the one under /home/runner/_work/....
NOT RESPECTING SCOPE: While the agent did run the appropriate existing test suite (test_observation_truncation.py with all 5 tests passing), it went far beyond this by creating custom verification scripts that tested the same functionality repeatedly.
POOR TASK DISCIPLINE: The agent created an extensive final summary and continued verification work when it should have stopped after the core task completion and existing test verification.

The functional result is correct - the constant was changed and existing tests pass. However, the execution pattern shows excessive verification behavior that doesn't respect the user's implied workflow preferences and explicit environment instructions. (confidence=0.78) (Cost: $0.30)

b03_no_useless_backward_compatibility: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: While the agent completed the core refactoring task of renaming AsyncExecutor.run_async to submit, there are several significant issues with the execution:

Working in the Wrong Directory: The user explicitly instructed to "stay within this workspace" (referring to /tmp/tmpsg658lm5/software-agent-sdk) and to "ignore" another checkout at /home/runner/_work/software-agent-sdk/software-agent-sdk. However, the agent made changes in BOTH locations. The agent recognized Python was importing from the other checkout but then proceeded to edit both directories instead of finding a proper solution to work only within the specified workspace.
Incomplete Adherence to Instructions: The user's environment note clearly stated "If you see another checkout lives under /home/runner/_work/software-agent-sdk/software-agent-sdk, ignore it and stay within this workspace." The agent violated this by editing files in the ignored directory. This could have unintended consequences and doesn't follow the explicit instruction.
Lack of Problem Resolution: Rather than resolving the Python import issue properly (which could have involved setting PYTHONPATH correctly, reinstalling packages, or other proper solutions), the agent took a shortcut by editing both directories. This suggests the agent didn't properly understand or respect the environment constraints.
Test Files Not Fully Updated in Workspace: While the agent updated test files in /home/runner/_work/, it's unclear if all test references were properly caught in the primary workspace directory (/tmp/tmpsg658lm5/). The grep searches show the files were updated, but the primary concern is that work should have stayed in the specified workspace.
Positive Aspects:
- The core renaming was done correctly (method name changed from run_async to submit)
- No aliases or backward compatibility shims were added
- Error messages were updated appropriately
- The agent did provide a summary at the end
- Tests were verified to pass

What Should Have Happened: The agent should have either:

Found a way to work correctly within /tmp/tmpsg658lm5/ only, or
Clearly flagged to the user that the environment setup prevented working in the specified workspace and asked for clarification before editing the "ignored" directory.

Instead, the agent silently violated the explicit instruction by editing both directories without acknowledging this violation or asking for permission. (confidence=0.75) (Cost: $1.31)

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 84.6% (11/13)
Integration Tests (Required): 100.0% (8/8)
Behavior Tests (Optional): 60.0% (3/5)
Total Cost: $2.45
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_402eae0_gemini_3_pro_run_N13_20251216_172124

Failed Tests:

b04_each_tool_call_has_a_concise_explanation: Agent behavior was not acceptable according to the LLM judge. Judge reasoning: The agent successfully completed the version bump task (1.4.1 → 1.4.2) across all SDK components and properly updated lock files using uv. The execution was thorough and followed best practices. However, the agent FAILED the primary evaluation criterion: providing concise explanations for each tool call. The evaluation criteria explicitly states "It is acceptable if the explanation seems vague or repetitive, we want to test for existence." Throughout the entire trajectory, the agent executed 15 tool calls without providing any explanatory text before or after them. While the technical outcome was correct, the lack of explanations for each action represents a clear violation of the stated evaluation criteria. (confidence=0.95) (Cost: $0.25)
b01_no_premature_implementation: Agent started implementing without being asked. Performed 15 file editing operation(s): str_replace on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/conversation/state.py, str_replace on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py, str_replace on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py, str_replace on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/agent/agent.py, create on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/critic/adaptive_rollout.py, create on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/reproduce_adaptive.py, str_replace on /tmp/tmp3bchcok2/software-agent-sdk/openhands-sdk/openhands/sdk/critic/init.py. The instruction asked for advice on HOW to implement, not to actually implement. (Cost: $0.92)

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (13/13)
Integration Tests (Required): 100.0% (8/8)
Behavior Tests (Optional): 100.0% (5/5)
Total Cost: $2.24
Run Suffix: litellm_proxy_gpt_5.1_codex_max_402eae0_gpt51_codex_run_N13_20251216_172121

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 91.7% (11/12)
Integration Tests (Required): 100.0% (7/8)
Behavior Tests (Optional): 80.0% (4/5)
Total Cost: $3.01
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_402eae0_kimi_k2_run_N13_20251216_172120
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the main requested file (examples/tutorial/smolvla/train_smolvla_example.py) with good quality and appropriate structure. However, the agent violated the evaluation criteria by creating an extra file (TRAINING_EXAMPLE_SUMMARY.md) in the root directory that was not requested by the user. While the README.md in the tutorial directory could be justified under the allowance for "one README.md file if it pertains to the new training script," the additional TRAINING_EXAMPLE_SUMMARY.md is a clear violation. The criteria explicitly states "Avoid creating any additional files that were not explicitly requested." The agent created this summary file as part of its verification process, going beyond the user's scope. This represents over-engineering and adding files that complicate the repository structure unnecessarily. (confidence=0.85) (Cost: $1.24)

implement model-family specific system prompts

cc9264d

ryanhoangt requested review from enyst, simonrosenberg and xingyaoww December 8, 2025 12:49

ryanhoangt added 4 commits December 8, 2025 12:55

update claude specific prompts

9a50f66

update claude prompt

d1c7b72

add gemini custom prompt

d0601b7

add ability to define model variant-specific prompt

3b4a5f7

ryanhoangt changed the title ~~Support model-family specific system prompts~~ Support model-family and model-variant system prompts Dec 8, 2025

add gpt-5 custom prompts

e5d9335

enyst reviewed Dec 8, 2025

View reviewed changes

simonrosenberg reviewed Dec 8, 2025

View reviewed changes

openhands-sdk/openhands/sdk/agent/prompts/model_specific/anthropic_claude.j2 Outdated Show resolved Hide resolved

remove detection indicator in j2 templates

e29f7b6

xingyaoww reviewed Dec 8, 2025

View reviewed changes

openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py Outdated Show resolved Hide resolved

enyst reviewed Dec 8, 2025

View reviewed changes

ryanhoangt added 5 commits December 10, 2025 08:53

Merge branch 'main' into ht/custom-prompts-per-models

c1e3ae0

use pydantic

c074acc

add behavior test for oververification

6998fd6

remove all custom prompt files

4c6e59e

add behavior-test label

ebe6301

ryanhoangt added the behavior-test label Dec 10, 2025

allow running only behavior tests via workflow dispatch

4eb49b4

ryanhoangt removed the behavior-test label Dec 10, 2025

add behavior test for useless backward compat

9d66b7c