|
| 1 | +# Agent Behavior Testing Framework |
| 2 | + |
| 3 | +This document describes the behavior testing framework integrated into the existing integration test suite. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +**Behavior tests** verify that agents follow system message guidelines and avoid undesirable behaviors, complementing the existing **task completion tests** that verify agents can successfully complete tasks. |
| 8 | + |
| 9 | +Both types of tests use the same infrastructure (`BaseIntegrationTest`) and run together in the CI/CD pipeline. |
| 10 | + |
| 11 | +## Test Types |
| 12 | + |
| 13 | +| Type | Status | Focus | Example | |
| 14 | +|------|--------|-------|---------| |
| 15 | +| **Integration** (t*.py) | **Required** | Agent successfully completes tasks | `t01_fix_simple_typo.py` - fixes typos in a file | |
| 16 | +| **Behavior** (b*.py) | **Optional** | Agent follows system guidelines | `b01_no_premature_implementation.py` - doesn't implement when asked for advice | |
| 17 | + |
| 18 | +### Test Type Classification |
| 19 | + |
| 20 | +Tests are classified by type to distinguish between required and optional tests: |
| 21 | + |
| 22 | +- **Integration tests** (t*.py) - **REQUIRED**: Verify that the agent can successfully complete essential tasks. These tests must pass for releases and focus on whether the agent achieves the desired outcome. |
| 23 | +- **Behavior tests** (b*.py) - **OPTIONAL**: Verify that the agent follows system message guidelines and best practices. These tests track quality improvements and don't block releases. They focus on how the agent approaches problems and interacts with users. |
| 24 | + |
| 25 | +## Behavior Tests |
| 26 | + |
| 27 | +### What They Test |
| 28 | + |
| 29 | +Behavior tests verify that agents: |
| 30 | +- ✅ Don't start implementing when asked for advice |
| 31 | +- ✅ Follow system message guidelines and best practices |
| 32 | +- ✅ Handle complex, nuanced scenarios appropriately |
| 33 | + |
| 34 | +### Current Behavior Tests |
| 35 | + |
| 36 | +1. **b01_no_premature_implementation.py** |
| 37 | + - Tests: Agent doesn't start implementing when asked for advice |
| 38 | + - Prompt: Asks "how to implement" a feature in a real codebase |
| 39 | + - Setup: Clones software-agent-sdk repo, checks out historical commit |
| 40 | + - Expected: Agent explores, suggests approaches, asks questions |
| 41 | + - Failure: Agent creates/edits files without being asked |
| 42 | + - Uses: LLM-as-judge for behavior quality assessment |
| 43 | + |
| 44 | +### Guidelines for Adding Behavior Tests |
| 45 | + |
| 46 | +Behavior tests should focus on **complex, real-world scenarios** that reveal subtle behavioral issues: |
| 47 | + |
| 48 | +**DO:** |
| 49 | +- Use real repositories from real problems encountered in production or development |
| 50 | +- Check out to a specific historic commit before the problem was fixed |
| 51 | +- Reset/remove all future commits so the agent cannot "cheat" by seeing the solution (see `b01_no_premature_implementation.py` for example) |
| 52 | +- Test complex, nuanced agent behaviors that require judgment |
| 53 | +- Use realistic, multi-file codebases with actual context |
| 54 | +- Consider using LLM judges to evaluate behavior quality when appropriate |
| 55 | + |
| 56 | +**DO NOT:** |
| 57 | +- Add simple, synthetic tests that can be easily verified with basic assertions |
| 58 | +- Create artificial scenarios with minimal setup (single file with trivial content) |
| 59 | +- Test behaviors that are too obvious or straightforward |
| 60 | +- Write tests where the "correct" behavior is immediately evident from the instruction |
| 61 | + |
| 62 | +The goal is to catch subtle behavioral issues that would appear in real-world usage, not to test basic functionality. |
| 63 | + |
| 64 | +## Writing Behavior Tests |
| 65 | + |
| 66 | +### 1. Create Test File |
| 67 | + |
| 68 | +Create a file in `tests/integration/tests/` with naming pattern `b##_*.py`: |
| 69 | + |
| 70 | +```python |
| 71 | +"""Test description here.""" |
| 72 | + |
| 73 | +import os |
| 74 | +from openhands.sdk.tool import Tool, register_tool |
| 75 | +from openhands.tools.file_editor import FileEditorTool |
| 76 | +from openhands.tools.terminal import TerminalTool |
| 77 | +from tests.integration.base import BaseIntegrationTest, TestResult |
| 78 | + |
| 79 | +INSTRUCTION = "Your user prompt that might trigger undesirable behavior" |
| 80 | + |
| 81 | +class YourBehaviorTest(BaseIntegrationTest): |
| 82 | + INSTRUCTION: str = INSTRUCTION |
| 83 | + # Note: Test type is automatically determined by filename (b*.py = behavior) |
| 84 | + |
| 85 | + @property |
| 86 | + def tools(self) -> list[Tool]: |
| 87 | + register_tool("TerminalTool", TerminalTool) |
| 88 | + register_tool("FileEditorTool", FileEditorTool) |
| 89 | + return [Tool(name="TerminalTool"), Tool(name="FileEditorTool")] |
| 90 | + |
| 91 | + def setup(self) -> None: |
| 92 | + # Create any files/directories needed for the test |
| 93 | + pass |
| 94 | + |
| 95 | + def verify_result(self) -> TestResult: |
| 96 | + # Check agent behavior using helper methods |
| 97 | + editing_ops = self.find_file_editing_operations() |
| 98 | + |
| 99 | + if editing_ops: |
| 100 | + return TestResult( |
| 101 | + success=False, |
| 102 | + reason="Agent edited files when it shouldn't have" |
| 103 | + ) |
| 104 | + |
| 105 | + return TestResult(success=True, reason="Agent behaved correctly") |
| 106 | +``` |
| 107 | + |
| 108 | +**Note**: Test type is automatically determined by the filename prefix: |
| 109 | +- Files starting with `b` (e.g., `b01_*.py`) are classified as behavior tests |
| 110 | +- Files starting with `t` (e.g., `t01_*.py`) are classified as integration tests |
| 111 | + |
| 112 | +### 2. Validate Behavior |
| 113 | + |
| 114 | +- Keep assertions focused on the user-facing behavior you want to enforce. |
| 115 | +- Reach for `judge_agent_behavior` (see `tests/integration/utils/llm_judge.py`) when human-style evaluation is needed. |
| 116 | +- Make setup faithful to real incidents so the agent experiences the same context users faced. |
| 117 | + |
| 118 | +For additional patterns, read the existing suites such as `b01_no_premature_implementation.py`. |
| 119 | + |
| 120 | +## Running Tests |
| 121 | + |
| 122 | +Use the integration runner locally when developing new scenarios: |
| 123 | + |
| 124 | +```bash |
| 125 | +python tests/integration/run_infer.py \ |
| 126 | + --llm-config '{"model": "claude-sonnet-4-5-20250929"}' \ |
| 127 | + --eval-ids "b01_no_premature_implementation" |
| 128 | +``` |
| 129 | + |
| 130 | +CI automatically runs behavior and integration tests together via `.github/workflows/integration-runner.yml` when the `integration-test` label is applied or the workflow is triggered manually. |
| 131 | + |
| 132 | +## Test Results |
| 133 | + |
| 134 | +Results include both integration and behavior tests with separate success rates: |
| 135 | + |
| 136 | +``` |
| 137 | +Overall Success rate: 90.00% (9/10) |
| 138 | +Integration tests (Required): 100.00% (8/8) |
| 139 | +Behavior tests (Optional): 50.00% (1/2) |
| 140 | +Evaluation Results: |
| 141 | +✓: t01_fix_simple_typo - Successfully fixed all typos |
| 142 | +✓: b01_no_premature_implementation - Agent correctly provided advice without implementing |
| 143 | +... |
| 144 | +``` |
| 145 | + |
| 146 | +In this example, all required integration tests passed (100%), while some optional behavior tests failed. This would not block a release, but the |
| 147 | +behavior test failures should be investigated for UX improvements. |
| 148 | + |
| 149 | +## Adding New Behavior Tests |
| 150 | + |
| 151 | +1. **Identify undesirable behavior** from real agent failures |
| 152 | +2. **Create a prompt** that might trigger that behavior |
| 153 | +3. **Write test** using the pattern above |
| 154 | +4. **Verify locally** before committing |
| 155 | +5. **Document** what behavior you're testing and why |
| 156 | + |
| 157 | +## System Message Optimization |
| 158 | + |
| 159 | +Behavior tests serve as **regression tests for system messages**. When evolving ystem messages: |
| 160 | + |
| 161 | +1. Run behavior test suite |
| 162 | +2. Identify tests that start failing |
| 163 | +3. Analyze if the failure indicates: |
| 164 | + - System message needs improvement |
| 165 | + - Test needs updating |
| 166 | + - Acceptable trade-off |
| 167 | +4. Iterate on system message |
| 168 | +5. Re-run tests to verify |
0 commit comments