OpenHands
diff --git a/‎.github/workflows/integration-runner.yml‎
Lines changed: 3 additions & 3 deletions b/‎.github/workflows/integration-runner.yml‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎.openhands/microagents/repo.md‎
Lines changed: 6 additions & 0 deletions b/‎.openhands/microagents/repo.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎tests/integration/BEHAVIOR_TESTS.md‎
Lines changed: 168 additions & 0 deletions b/‎tests/integration/BEHAVIOR_TESTS.md‎
Lines changed: 168 additions & 0 deletions
diff --git a/‎tests/integration/README.md‎
Lines changed: 40 additions & 3 deletions b/‎tests/integration/README.md‎
Lines changed: 40 additions & 3 deletions
diff --git a/‎tests/integration/base.py‎
Lines changed: 39 additions & 0 deletions b/‎tests/integration/base.py‎
Lines changed: 39 additions & 0 deletions
@@ -58,10 +58,10 @@ jobs:
                       run-suffix: sonnet_run
                       llm-config:
                           model: litellm_proxy/claude-sonnet-4-5-20250929
-                    - name: GPT-5 Mini 2025-08-07
-                      run-suffix: gpt5_mini_run
+                    - name: GPT-5.1 Codex Max
+                      run-suffix: gpt51_codex_run
                       llm-config:
-                          model: litellm_proxy/gpt-5-mini-2025-08-07
+                          model: litellm_proxy/gpt-5.1-codex-max
                           temperature: 1.0
                     - name: Deepseek Chat
                       run-suffix: deepseek_run
 
@@ -138,6 +138,12 @@ When reviewing code, provide constructive feedback:
 - DON'T write TEST CLASSES unless absolutely necessary!
 - If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!
 - Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository.
+
+# Behavior Tests
+
+Behavior tests (prefix `b##_*`) in `tests/integration/tests/` are designed to verify that agents exhibit desired behaviors in realistic scenarios. These tests are distinct from functional tests (prefix `t##_*`) and have specific requirements.
+
+Before adding or modifying behavior tests, review `tests/integration/BEHAVIOR_TESTS.md` for the latest workflow, expectations, and examples.
 </TESTING>
 
 <DOCUMENTATION_WORKFLOW>
 
@@ -0,0 +1,168 @@
+# Agent Behavior Testing Framework
+
+This document describes the behavior testing framework integrated into the existing integration test suite.
+
+## Overview
+
+**Behavior tests** verify that agents follow system message guidelines and avoid undesirable behaviors, complementing the existing **task completion tests** that verify agents can successfully complete tasks.
+
+Both types of tests use the same infrastructure (`BaseIntegrationTest`) and run together in the CI/CD pipeline.
+
+## Test Types
+
+| Type | Status | Focus | Example |
+|------|--------|-------|---------|
+| **Integration** (t*.py) | **Required** | Agent successfully completes tasks | `t01_fix_simple_typo.py` - fixes typos in a file |
+| **Behavior** (b*.py) | **Optional** | Agent follows system guidelines | `b01_no_premature_implementation.py` - doesn't implement when asked for advice |
+
+### Test Type Classification
+
+Tests are classified by type to distinguish between required and optional tests:
+
+- **Integration tests** (t*.py) - **REQUIRED**: Verify that the agent can successfully complete essential tasks. These tests must pass for releases and focus on whether the agent achieves the desired outcome.
+- **Behavior tests** (b*.py) - **OPTIONAL**: Verify that the agent follows system message guidelines and best practices. These tests track quality improvements and don't block releases. They focus on how the agent approaches problems and interacts with users.
+
+## Behavior Tests
+
+### What They Test
+
+Behavior tests verify that agents:
+- ✅ Don't start implementing when asked for advice
+- ✅ Follow system message guidelines and best practices
+- ✅ Handle complex, nuanced scenarios appropriately
+
+### Current Behavior Tests
+
+1. **b01_no_premature_implementation.py**
+   - Tests: Agent doesn't start implementing when asked for advice
+   - Prompt: Asks "how to implement" a feature in a real codebase
+   - Setup: Clones software-agent-sdk repo, checks out historical commit
+   - Expected: Agent explores, suggests approaches, asks questions
+   - Failure: Agent creates/edits files without being asked
+   - Uses: LLM-as-judge for behavior quality assessment
+
+### Guidelines for Adding Behavior Tests
+
+Behavior tests should focus on **complex, real-world scenarios** that reveal subtle behavioral issues:
+
+**DO:**
+- Use real repositories from real problems encountered in production or development
+- Check out to a specific historic commit before the problem was fixed
+- Reset/remove all future commits so the agent cannot "cheat" by seeing the solution (see `b01_no_premature_implementation.py` for example)
+- Test complex, nuanced agent behaviors that require judgment
+- Use realistic, multi-file codebases with actual context
+- Consider using LLM judges to evaluate behavior quality when appropriate
+
+**DO NOT:**
+- Add simple, synthetic tests that can be easily verified with basic assertions
+- Create artificial scenarios with minimal setup (single file with trivial content)
+- Test behaviors that are too obvious or straightforward
+- Write tests where the "correct" behavior is immediately evident from the instruction
+
+The goal is to catch subtle behavioral issues that would appear in real-world usage, not to test basic functionality.
+
+## Writing Behavior Tests
+
+### 1. Create Test File
+
+Create a file in `tests/integration/tests/` with naming pattern `b##_*.py`:
+
+```python
+"""Test description here."""
+
+import os
+from openhands.sdk.tool import Tool, register_tool
+from openhands.tools.file_editor import FileEditorTool
+from openhands.tools.terminal import TerminalTool
+from tests.integration.base import BaseIntegrationTest, TestResult
+
+INSTRUCTION = "Your user prompt that might trigger undesirable behavior"
+
+class YourBehaviorTest(BaseIntegrationTest):
+    INSTRUCTION: str = INSTRUCTION
+    # Note: Test type is automatically determined by filename (b*.py = behavior)
+
+    @property
+    def tools(self) -> list[Tool]:
+        register_tool("TerminalTool", TerminalTool)
+        register_tool("FileEditorTool", FileEditorTool)
+        return [Tool(name="TerminalTool"), Tool(name="FileEditorTool")]
+
+    def setup(self) -> None:
+        # Create any files/directories needed for the test
+        pass
+
+    def verify_result(self) -> TestResult:
+        # Check agent behavior using helper methods
+        editing_ops = self.find_file_editing_operations()
+
+        if editing_ops:
+            return TestResult(
+                success=False,
+                reason="Agent edited files when it shouldn't have"
+            )
+
+        return TestResult(success=True, reason="Agent behaved correctly")
+```
+
+**Note**: Test type is automatically determined by the filename prefix:
+- Files starting with `b` (e.g., `b01_*.py`) are classified as behavior tests
+- Files starting with `t` (e.g., `t01_*.py`) are classified as integration tests
+
+### 2. Validate Behavior
+
+- Keep assertions focused on the user-facing behavior you want to enforce.
+- Reach for `judge_agent_behavior` (see `tests/integration/utils/llm_judge.py`) when human-style evaluation is needed.
+- Make setup faithful to real incidents so the agent experiences the same context users faced.
+
+For additional patterns, read the existing suites such as `b01_no_premature_implementation.py`.
+
+## Running Tests
+
+Use the integration runner locally when developing new scenarios:
+
+```bash
+python tests/integration/run_infer.py \
+  --llm-config '{"model": "claude-sonnet-4-5-20250929"}' \
+  --eval-ids "b01_no_premature_implementation"
+```
+
+CI automatically runs behavior and integration tests together via `.github/workflows/integration-runner.yml` when the `integration-test` label is applied or the workflow is triggered manually.
+
+## Test Results
+
+Results include both integration and behavior tests with separate success rates:
+
+```
+Overall Success rate: 90.00% (9/10)
+Integration tests (Required): 100.00% (8/8)
+Behavior tests (Optional): 50.00% (1/2)
+Evaluation Results:
+✓: t01_fix_simple_typo - Successfully fixed all typos
+✓: b01_no_premature_implementation - Agent correctly provided advice without implementing
+...
+```
+
+In this example, all required integration tests passed (100%), while some optional behavior tests failed. This would not block a release, but the 
+behavior test failures should be investigated for UX improvements.
+
+## Adding New Behavior Tests
+
+1. **Identify undesirable behavior** from real agent failures
+2. **Create a prompt** that might trigger that behavior
+3. **Write test** using the pattern above
+4. **Verify locally** before committing
+5. **Document** what behavior you're testing and why
+
+## System Message Optimization
+
+Behavior tests serve as **regression tests for system messages**. When evolving ystem messages:
+
+1. Run behavior test suite
+2. Identify tests that start failing
+3. Analyze if the failure indicates:
+   - System message needs improvement
+   - Test needs updating
+   - Acceptable trade-off
+4. Iterate on system message
+5. Re-run tests to verify
@@ -6,18 +6,32 @@ This directory contains integration tests for the agent-sdk that use real LLM ca
 
 The integration tests are designed to verify that the agent-sdk works correctly with real LLM models by running complete workflows. Each test creates a temporary environment, provides the agent with specific tools, gives it an instruction, and then verifies the results.
 
+### Test Types
+
+Tests are classified into two types based on their filename prefix:
+
+- **Integration tests** (`t*.py`) - **REQUIRED**: Verify that the agent successfully completes essential tasks. These tests must pass for releases and focus on task completion and outcomes.
+- **Behavior tests** (`b*.py`) - **OPTIONAL**: Verify that the agent follows system message guidelines and best practices. These tests track quality improvements and focus on how the agent approaches problems. Failures don't block releases but should be addressed for optimal user experience.
+
+Success rates are calculated separately for each test type to track both completion capability and behavior quality.
+
+See [BEHAVIOR_TESTS.md](BEHAVIOR_TESTS.md) for more details on behavior testing.
+
 ## Directory Structure
 
 ```
 tests/integration/
 ├── README.md                    # This file
+├── BEHAVIOR_TESTS.md            # Documentation for behavior testing framework
 ├── __init__.py                  # Package initialization
 ├── base.py                      # Base classes for integration tests
 ├── run_infer.py                 # Main test runner script
 ├── run_infer.sh                 # Shell script wrapper for running tests
 ├── outputs/                     # Test results and reports (auto-generated)
-└── tests/                       # Individual test files (e.g., t01_fix_simple_typo_class_based.py)
-│   └── t*.py
+├── tests/                       # Individual test files
+│   ├── t*.py                    # Task completion tests (critical)
+│   └── b*.py                    # Agent behavior tests (ux)
+└── utils/                       # Test utilities (e.g., llm_judge.py)
 ```
 
 ## Running Integration Tests
@@ -48,4 +62,27 @@ The GitHub workflow runs integration tests in the following scenarios:
 
 1. **Pull Request Labels**: When a PR is labeled with `integration-test`
 2. **Manual Trigger**: Via workflow dispatch with a required reason
-3. **Scheduled Runs**: Daily at 10:30 PM UTC (cron: `30 22 * * *`)
+3. **Scheduled Runs**: Daily at 10:30 PM UTC (cron: `30 22 * * *`)
+
+## Available Tests
+
+### Integration Tests (`t*.py`) - **Required**
+
+These tests must pass for releases and verify that the agent can successfully complete essential tasks:
+
+- **t01_fix_simple_typo** - Tests that the agent can fix typos in a file
+- **t02_add_bash_hello** - Tests that the agent can execute bash commands
+- **t03_jupyter_write_file** - Tests Jupyter notebook integration
+- **t04_git_staging** - Tests git operations
+- **t05_simple_browsing** - Tests web browsing capabilities
+- **t06_github_pr_browsing** - Tests GitHub PR browsing
+- **t07_interactive_commands** - Tests interactive command handling
+- **t08_image_file_viewing** - Tests image file viewing capabilities
+
+### Behavior Tests (`b*.py`) - **Optional**
+
+These tests track quality improvements and don't block releases. They verify that agents follow system message guidelines and handle complex, nuanced scenarios appropriately:
+
+- **b01_no_premature_implementation** - Tests that the agent doesn't start implementing when asked for advice. Uses a real codebase (software-agent-sdk checked out to a historical commit) to test that the agent explores, provides suggestions, and asks clarifying questions instead of immediately creating or editing files.
+
+For more details on behavior testing and guidelines for adding new tests, see [BEHAVIOR_TESTS.md](BEHAVIOR_TESTS.md).
@@ -200,6 +200,45 @@ def verify_result(self) -> TestResult:
         """
         pass
 
+    def add_judge_usage(
+        self, prompt_tokens: int, completion_tokens: int, cost: float
+    ) -> None:
+        """
+        Add LLM judge usage to conversation stats.
+
+        This ensures judge costs are included in the total test cost.
+
+        Args:
+            prompt_tokens: Number of prompt tokens used by judge
+            completion_tokens: Number of completion tokens used by judge
+            cost: Cost of the judge call
+        """
+        from openhands.sdk.llm.utils.metrics import TokenUsage
+
+        # Add to conversation stats for the test LLM
+        stats = self.conversation.conversation_stats
+        if stats:
+            try:
+                metrics = stats.get_metrics_for_usage("test-llm")
+                # Update accumulated metrics
+                if metrics.accumulated_token_usage:
+                    metrics.accumulated_token_usage.prompt_tokens = (
+                        metrics.accumulated_token_usage.prompt_tokens or 0
+                    ) + prompt_tokens
+                    metrics.accumulated_token_usage.completion_tokens = (
+                        metrics.accumulated_token_usage.completion_tokens or 0
+                    ) + completion_tokens
+                else:
+                    # Create new TokenUsage if it doesn't exist
+                    metrics.accumulated_token_usage = TokenUsage(
+                        prompt_tokens=prompt_tokens,
+                        completion_tokens=completion_tokens,
+                    )
+                metrics.accumulated_cost += cost
+            except Exception:
+                # If test-llm doesn't exist in stats yet, skip
+                pass
+
     def teardown(self):
         """
         Clean up test resources.