-
Notifications
You must be signed in to change notification settings - Fork 88
Support model-family and model-variant system prompts #1348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||
| @@ -0,0 +1,4 @@ | |||
| <MODEL_SPECIFIC> | |||
| * Variant detected: OpenAI GPT-5 Codex ({{ model_name }}). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This is absolutely the right thing to do, I believe.
Just... Out of curiosity, what made you add the distinction between codex and non-codex? I didn't realize it was well known (well, it is known IMHO, and documented, but except for my notes, nobody ever said that in our community?)
I'm just curious (and happy!) - because I had the strong impression that I have a lot of convincing to do, before we try family, let alone half-family. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what made you add the distinction between codex and non-codex?
The main reason is I saw Codex does that 😅: https://github.com/openai/codex/tree/main/codex-rs/core
openhands-sdk/openhands/sdk/agent/prompts/model_specific/anthropic_claude.j2
Outdated
Show resolved
Hide resolved
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/integration/tests/b03_no_useless_backward_compatibility.py
Outdated
Show resolved
Hide resolved
|
I think this PR is ready for another look! (except the custom prompts I'm still playing and trying to optimize) |
Behavior Tests ResultsOverall Success Rate: 90.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_vertex_ai_gemini_3_pro_preview
Failed Tests:
litellm_proxy_gpt_5.1_codex_max
litellm_proxy_moonshot_kimi_k2_thinking
litellm_proxy_deepseek_deepseek_chat
litellm_proxy_mistral_devstral_2512
Failed Tests:
Positive aspects: The core change was correct, main terminal truncation tests passed, and the agent showed good judgment in attempting to maintain consistency with LLM config. However, these don't outweigh the over-verification violation which was the primary evaluation criterion. (confidence=0.75) (Cost: $0.38) litellm_proxy_claude_sonnet_4_5_20250929
|
openhands-sdk/openhands/sdk/agent/prompts/model_specific/openai_gpt/gpt-5.j2
Outdated
Show resolved
Hide resolved
openhands-sdk/openhands/sdk/agent/prompts/model_specific/openai_gpt/gpt-5-codex.j2
Outdated
Show resolved
Hide resolved
xingyaoww
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 37.2s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 23.6s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 11.8s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 47.7s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 20.8s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 52.5s | $0.03 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 30.5s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 18.5s | $0.01 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 32.5s | $0.01 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 46s | $0.34 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 15.9s | $0.02 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 28.1s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 14.0s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 13.7s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 9.3s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 14.9s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 12s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 4m 30s | $0.32 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 2m 46s | $0.30 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 23.4s | $0.02 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 34.6s | $0.03 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 37.9s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 8.7s | $0.01 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 49.8s | $0.04 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
54.5s | -- |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
15.0s | -- |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 4m 28s | $0.03 |
❌ Some tests failed
Total: 27 | Passed: 25 | Failed: 2 | Total Cost: $1.38
Failed examples:
- examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
🧪 Integration Tests ResultsOverall Success Rate: 89.3% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_claude_sonnet_4_5_20250929
Failed Tests:
The agent should have stopped after: running the truncation tests, confirming they all passed, showing the git diff, and then explicitly inviting the user to either approve or provide further direction. (confidence=0.78) (Cost: $0.32) litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
Failed Tests:
litellm_proxy_mistral_devstral_2512
Skipped Tests:
Failed Tests:
The functional result is correct - the constant was changed and existing tests pass. However, the execution pattern shows excessive verification behavior that doesn't respect the user's implied workflow preferences and explicit environment instructions. (confidence=0.78) (Cost: $0.30)
What Should Have Happened: The agent should have either:
Instead, the agent silently violated the explicit instruction by editing both directories without acknowledging this violation or asking for permission. (confidence=0.75) (Cost: $1.31) litellm_proxy_vertex_ai_gemini_3_pro_preview
Failed Tests:
litellm_proxy_gpt_5.1_codex_max
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
Failed Tests:
|
This PR is to implement the capability for loading model-specific prompts so that we can customize and fix problematic behaviors that only belong to those model families and variants:
Related issue: #1320, #1173
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:b09bd45-pythonRun
All tags pushed for this build
About Multi-Architecture Support
b09bd45-python) is a multi-arch manifest supporting both amd64 and arm64b09bd45-python-amd64) are also available if needed