Skip to content

Commit e6872fb

Browse files
xingyaowwopenhands-agentryanhoangtenyst
authored
Add agent behavior tests to integration tests (#1321)
Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Hoang Tran <descience.thh10@gmail.com> Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
1 parent 2daed50 commit e6872fb

File tree

11 files changed

+1002
-21
lines changed

11 files changed

+1002
-21
lines changed

.github/workflows/integration-runner.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,10 @@ jobs:
5858
run-suffix: sonnet_run
5959
llm-config:
6060
model: litellm_proxy/claude-sonnet-4-5-20250929
61-
- name: GPT-5 Mini 2025-08-07
62-
run-suffix: gpt5_mini_run
61+
- name: GPT-5.1 Codex Max
62+
run-suffix: gpt51_codex_run
6363
llm-config:
64-
model: litellm_proxy/gpt-5-mini-2025-08-07
64+
model: litellm_proxy/gpt-5.1-codex-max
6565
temperature: 1.0
6666
- name: Deepseek Chat
6767
run-suffix: deepseek_run

.openhands/microagents/repo.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,12 @@ When reviewing code, provide constructive feedback:
138138
- DON'T write TEST CLASSES unless absolutely necessary!
139139
- If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!
140140
- Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository.
141+
142+
# Behavior Tests
143+
144+
Behavior tests (prefix `b##_*`) in `tests/integration/tests/` are designed to verify that agents exhibit desired behaviors in realistic scenarios. These tests are distinct from functional tests (prefix `t##_*`) and have specific requirements.
145+
146+
Before adding or modifying behavior tests, review `tests/integration/BEHAVIOR_TESTS.md` for the latest workflow, expectations, and examples.
141147
</TESTING>
142148

143149
<DOCUMENTATION_WORKFLOW>
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Agent Behavior Testing Framework
2+
3+
This document describes the behavior testing framework integrated into the existing integration test suite.
4+
5+
## Overview
6+
7+
**Behavior tests** verify that agents follow system message guidelines and avoid undesirable behaviors, complementing the existing **task completion tests** that verify agents can successfully complete tasks.
8+
9+
Both types of tests use the same infrastructure (`BaseIntegrationTest`) and run together in the CI/CD pipeline.
10+
11+
## Test Types
12+
13+
| Type | Status | Focus | Example |
14+
|------|--------|-------|---------|
15+
| **Integration** (t*.py) | **Required** | Agent successfully completes tasks | `t01_fix_simple_typo.py` - fixes typos in a file |
16+
| **Behavior** (b*.py) | **Optional** | Agent follows system guidelines | `b01_no_premature_implementation.py` - doesn't implement when asked for advice |
17+
18+
### Test Type Classification
19+
20+
Tests are classified by type to distinguish between required and optional tests:
21+
22+
- **Integration tests** (t*.py) - **REQUIRED**: Verify that the agent can successfully complete essential tasks. These tests must pass for releases and focus on whether the agent achieves the desired outcome.
23+
- **Behavior tests** (b*.py) - **OPTIONAL**: Verify that the agent follows system message guidelines and best practices. These tests track quality improvements and don't block releases. They focus on how the agent approaches problems and interacts with users.
24+
25+
## Behavior Tests
26+
27+
### What They Test
28+
29+
Behavior tests verify that agents:
30+
- ✅ Don't start implementing when asked for advice
31+
- ✅ Follow system message guidelines and best practices
32+
- ✅ Handle complex, nuanced scenarios appropriately
33+
34+
### Current Behavior Tests
35+
36+
1. **b01_no_premature_implementation.py**
37+
- Tests: Agent doesn't start implementing when asked for advice
38+
- Prompt: Asks "how to implement" a feature in a real codebase
39+
- Setup: Clones software-agent-sdk repo, checks out historical commit
40+
- Expected: Agent explores, suggests approaches, asks questions
41+
- Failure: Agent creates/edits files without being asked
42+
- Uses: LLM-as-judge for behavior quality assessment
43+
44+
### Guidelines for Adding Behavior Tests
45+
46+
Behavior tests should focus on **complex, real-world scenarios** that reveal subtle behavioral issues:
47+
48+
**DO:**
49+
- Use real repositories from real problems encountered in production or development
50+
- Check out to a specific historic commit before the problem was fixed
51+
- Reset/remove all future commits so the agent cannot "cheat" by seeing the solution (see `b01_no_premature_implementation.py` for example)
52+
- Test complex, nuanced agent behaviors that require judgment
53+
- Use realistic, multi-file codebases with actual context
54+
- Consider using LLM judges to evaluate behavior quality when appropriate
55+
56+
**DO NOT:**
57+
- Add simple, synthetic tests that can be easily verified with basic assertions
58+
- Create artificial scenarios with minimal setup (single file with trivial content)
59+
- Test behaviors that are too obvious or straightforward
60+
- Write tests where the "correct" behavior is immediately evident from the instruction
61+
62+
The goal is to catch subtle behavioral issues that would appear in real-world usage, not to test basic functionality.
63+
64+
## Writing Behavior Tests
65+
66+
### 1. Create Test File
67+
68+
Create a file in `tests/integration/tests/` with naming pattern `b##_*.py`:
69+
70+
```python
71+
"""Test description here."""
72+
73+
import os
74+
from openhands.sdk.tool import Tool, register_tool
75+
from openhands.tools.file_editor import FileEditorTool
76+
from openhands.tools.terminal import TerminalTool
77+
from tests.integration.base import BaseIntegrationTest, TestResult
78+
79+
INSTRUCTION = "Your user prompt that might trigger undesirable behavior"
80+
81+
class YourBehaviorTest(BaseIntegrationTest):
82+
INSTRUCTION: str = INSTRUCTION
83+
# Note: Test type is automatically determined by filename (b*.py = behavior)
84+
85+
@property
86+
def tools(self) -> list[Tool]:
87+
register_tool("TerminalTool", TerminalTool)
88+
register_tool("FileEditorTool", FileEditorTool)
89+
return [Tool(name="TerminalTool"), Tool(name="FileEditorTool")]
90+
91+
def setup(self) -> None:
92+
# Create any files/directories needed for the test
93+
pass
94+
95+
def verify_result(self) -> TestResult:
96+
# Check agent behavior using helper methods
97+
editing_ops = self.find_file_editing_operations()
98+
99+
if editing_ops:
100+
return TestResult(
101+
success=False,
102+
reason="Agent edited files when it shouldn't have"
103+
)
104+
105+
return TestResult(success=True, reason="Agent behaved correctly")
106+
```
107+
108+
**Note**: Test type is automatically determined by the filename prefix:
109+
- Files starting with `b` (e.g., `b01_*.py`) are classified as behavior tests
110+
- Files starting with `t` (e.g., `t01_*.py`) are classified as integration tests
111+
112+
### 2. Validate Behavior
113+
114+
- Keep assertions focused on the user-facing behavior you want to enforce.
115+
- Reach for `judge_agent_behavior` (see `tests/integration/utils/llm_judge.py`) when human-style evaluation is needed.
116+
- Make setup faithful to real incidents so the agent experiences the same context users faced.
117+
118+
For additional patterns, read the existing suites such as `b01_no_premature_implementation.py`.
119+
120+
## Running Tests
121+
122+
Use the integration runner locally when developing new scenarios:
123+
124+
```bash
125+
python tests/integration/run_infer.py \
126+
--llm-config '{"model": "claude-sonnet-4-5-20250929"}' \
127+
--eval-ids "b01_no_premature_implementation"
128+
```
129+
130+
CI automatically runs behavior and integration tests together via `.github/workflows/integration-runner.yml` when the `integration-test` label is applied or the workflow is triggered manually.
131+
132+
## Test Results
133+
134+
Results include both integration and behavior tests with separate success rates:
135+
136+
```
137+
Overall Success rate: 90.00% (9/10)
138+
Integration tests (Required): 100.00% (8/8)
139+
Behavior tests (Optional): 50.00% (1/2)
140+
Evaluation Results:
141+
✓: t01_fix_simple_typo - Successfully fixed all typos
142+
✓: b01_no_premature_implementation - Agent correctly provided advice without implementing
143+
...
144+
```
145+
146+
In this example, all required integration tests passed (100%), while some optional behavior tests failed. This would not block a release, but the
147+
behavior test failures should be investigated for UX improvements.
148+
149+
## Adding New Behavior Tests
150+
151+
1. **Identify undesirable behavior** from real agent failures
152+
2. **Create a prompt** that might trigger that behavior
153+
3. **Write test** using the pattern above
154+
4. **Verify locally** before committing
155+
5. **Document** what behavior you're testing and why
156+
157+
## System Message Optimization
158+
159+
Behavior tests serve as **regression tests for system messages**. When evolving ystem messages:
160+
161+
1. Run behavior test suite
162+
2. Identify tests that start failing
163+
3. Analyze if the failure indicates:
164+
- System message needs improvement
165+
- Test needs updating
166+
- Acceptable trade-off
167+
4. Iterate on system message
168+
5. Re-run tests to verify

tests/integration/README.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,32 @@ This directory contains integration tests for the agent-sdk that use real LLM ca
66

77
The integration tests are designed to verify that the agent-sdk works correctly with real LLM models by running complete workflows. Each test creates a temporary environment, provides the agent with specific tools, gives it an instruction, and then verifies the results.
88

9+
### Test Types
10+
11+
Tests are classified into two types based on their filename prefix:
12+
13+
- **Integration tests** (`t*.py`) - **REQUIRED**: Verify that the agent successfully completes essential tasks. These tests must pass for releases and focus on task completion and outcomes.
14+
- **Behavior tests** (`b*.py`) - **OPTIONAL**: Verify that the agent follows system message guidelines and best practices. These tests track quality improvements and focus on how the agent approaches problems. Failures don't block releases but should be addressed for optimal user experience.
15+
16+
Success rates are calculated separately for each test type to track both completion capability and behavior quality.
17+
18+
See [BEHAVIOR_TESTS.md](BEHAVIOR_TESTS.md) for more details on behavior testing.
19+
920
## Directory Structure
1021

1122
```
1223
tests/integration/
1324
├── README.md # This file
25+
├── BEHAVIOR_TESTS.md # Documentation for behavior testing framework
1426
├── __init__.py # Package initialization
1527
├── base.py # Base classes for integration tests
1628
├── run_infer.py # Main test runner script
1729
├── run_infer.sh # Shell script wrapper for running tests
1830
├── outputs/ # Test results and reports (auto-generated)
19-
└── tests/ # Individual test files (e.g., t01_fix_simple_typo_class_based.py)
20-
│ └── t*.py
31+
├── tests/ # Individual test files
32+
│ ├── t*.py # Task completion tests (critical)
33+
│ └── b*.py # Agent behavior tests (ux)
34+
└── utils/ # Test utilities (e.g., llm_judge.py)
2135
```
2236

2337
## Running Integration Tests
@@ -48,4 +62,27 @@ The GitHub workflow runs integration tests in the following scenarios:
4862

4963
1. **Pull Request Labels**: When a PR is labeled with `integration-test`
5064
2. **Manual Trigger**: Via workflow dispatch with a required reason
51-
3. **Scheduled Runs**: Daily at 10:30 PM UTC (cron: `30 22 * * *`)
65+
3. **Scheduled Runs**: Daily at 10:30 PM UTC (cron: `30 22 * * *`)
66+
67+
## Available Tests
68+
69+
### Integration Tests (`t*.py`) - **Required**
70+
71+
These tests must pass for releases and verify that the agent can successfully complete essential tasks:
72+
73+
- **t01_fix_simple_typo** - Tests that the agent can fix typos in a file
74+
- **t02_add_bash_hello** - Tests that the agent can execute bash commands
75+
- **t03_jupyter_write_file** - Tests Jupyter notebook integration
76+
- **t04_git_staging** - Tests git operations
77+
- **t05_simple_browsing** - Tests web browsing capabilities
78+
- **t06_github_pr_browsing** - Tests GitHub PR browsing
79+
- **t07_interactive_commands** - Tests interactive command handling
80+
- **t08_image_file_viewing** - Tests image file viewing capabilities
81+
82+
### Behavior Tests (`b*.py`) - **Optional**
83+
84+
These tests track quality improvements and don't block releases. They verify that agents follow system message guidelines and handle complex, nuanced scenarios appropriately:
85+
86+
- **b01_no_premature_implementation** - Tests that the agent doesn't start implementing when asked for advice. Uses a real codebase (software-agent-sdk checked out to a historical commit) to test that the agent explores, provides suggestions, and asks clarifying questions instead of immediately creating or editing files.
87+
88+
For more details on behavior testing and guidelines for adding new tests, see [BEHAVIOR_TESTS.md](BEHAVIOR_TESTS.md).

tests/integration/base.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -200,6 +200,45 @@ def verify_result(self) -> TestResult:
200200
"""
201201
pass
202202

203+
def add_judge_usage(
204+
self, prompt_tokens: int, completion_tokens: int, cost: float
205+
) -> None:
206+
"""
207+
Add LLM judge usage to conversation stats.
208+
209+
This ensures judge costs are included in the total test cost.
210+
211+
Args:
212+
prompt_tokens: Number of prompt tokens used by judge
213+
completion_tokens: Number of completion tokens used by judge
214+
cost: Cost of the judge call
215+
"""
216+
from openhands.sdk.llm.utils.metrics import TokenUsage
217+
218+
# Add to conversation stats for the test LLM
219+
stats = self.conversation.conversation_stats
220+
if stats:
221+
try:
222+
metrics = stats.get_metrics_for_usage("test-llm")
223+
# Update accumulated metrics
224+
if metrics.accumulated_token_usage:
225+
metrics.accumulated_token_usage.prompt_tokens = (
226+
metrics.accumulated_token_usage.prompt_tokens or 0
227+
) + prompt_tokens
228+
metrics.accumulated_token_usage.completion_tokens = (
229+
metrics.accumulated_token_usage.completion_tokens or 0
230+
) + completion_tokens
231+
else:
232+
# Create new TokenUsage if it doesn't exist
233+
metrics.accumulated_token_usage = TokenUsage(
234+
prompt_tokens=prompt_tokens,
235+
completion_tokens=completion_tokens,
236+
)
237+
metrics.accumulated_cost += cost
238+
except Exception:
239+
# If test-llm doesn't exist in stats yet, skip
240+
pass
241+
203242
def teardown(self):
204243
"""
205244
Clean up test resources.

0 commit comments

Comments
 (0)