feat: add Harbor Terminal-Bench integration for Sisyphus agent #574

minpeter · 2026-01-07T12:52:14Z

Summary

Add benchmark infrastructure to evaluate Sisyphus agent on Harbor Terminal-Bench.

SisyphusAgent: Custom Harbor InstalledAgent that runs OpenCode CLI with --agent Sisyphus in headless mode
install-sisyphus.sh.j2: Jinja2 template for container setup (Node.js, OpenCode, oh-my-opencode)
ATIF trajectory parsing: Converts OpenCode JSON output to Harbor's trajectory format for metrics collection

Test Results

Verified with hello-world@1.0 dataset:

Agent: sisyphus (claude-sonnet-4-5-20250929)
Dataset: hello-world
Trials: 1 | Errors: 0 | Mean: 1.000 (100% pass)

Usage

# Quick validation
PYTHONPATH=/path/to/oh-my-opencode harbor run -d 'hello-world@1.0' \
  --agent-import-path benchmark.sisyphus_agent:SisyphusAgent \
  -m anthropic/claude-sonnet-4-5-20250929 -n 1

# Full Terminal-Bench 2.0
PYTHONPATH=/path/to/oh-my-opencode harbor run -d 'terminal-bench@2.0' \
  --agent-import-path benchmark.sisyphus_agent:SisyphusAgent \
  -m anthropic/claude-sonnet-4-5-20250929 -n 4

Files Changed

File	Description
`benchmark/sisyphus_agent.py`	Harbor InstalledAgent implementation
`benchmark/install-sisyphus.sh.j2`	Container setup template
`benchmark/__init__.py`	Python package init
`.gitignore`	Added `__pycache__`, `jobs`

Summary by cubic

Adds Harbor Terminal-Bench integration for the Sisyphus agent, running OpenCode in headless mode with provider-aware setup. This enables consistent benchmarking across Harbor datasets.

New Features
- SisyphusAgent: Runs "opencode --model <provider/model> run --agent Sisyphus --format=json" with provider-specific env; logs to /logs/agent/sisyphus.txt.
- install-sisyphus.sh.j2: Installs bun, OpenCode, and oh-my-opencode; supports pinned versions; disables noisy hooks; defaults to opencode/zen (free) to reduce benchmark costs.
- .gitignore: Adds pycache and jobs.
Migration
- Use model names as provider/model (e.g., anthropic/claude-sonnet-4-5) and export required API keys for that provider.
- Run Harbor with --agent-import-path benchmark.sisyphus_agent:SisyphusAgent against desired datasets.

^{Written for commit 8eee7ad. Summary will update on new commits.}

Add benchmark infrastructure to evaluate Sisyphus on Harbor Terminal-Bench: - SisyphusAgent: Custom Harbor InstalledAgent that runs OpenCode with Sisyphus - install-sisyphus.sh.j2: Jinja2 template for container setup - ATIF trajectory parsing for metrics collection Tested with hello-world@1.0 dataset (100% pass rate)

github-actions · 2026-01-07T12:52:25Z

All contributors have signed the CLA. Thank you! ✅
_{Posted by the CLA Assistant Lite bot.}

minpeter · 2026-01-07T12:53:05Z

I have read the CLA Document and I hereby sign the CLA

greptile-apps · 2026-01-07T12:54:46Z

Greptile Summary

This PR adds Harbor Terminal-Bench integration for the Sisyphus agent, enabling automated benchmarking of the agent's code generation capabilities against standard datasets. The implementation creates a custom Harbor InstalledAgent that wraps the OpenCode CLI, automatically manages model provider API keys, and converts OpenCode's JSON event output into Harbor's ATIF (Agent Trajectory Interchange Format) for metrics collection.

Key changes:

benchmark/sisyphus_agent.py: Harbor-compatible agent wrapper that orchestrates the OpenCode CLI with Sisyphus in headless mode and parses ATIF trajectories
benchmark/install-sisyphus.sh.j2: Container setup template installing NVM, Node.js, OpenCode, and oh-my-opencode with configurable versions
.gitignore: Added Python cache and Harbor job artifacts directories

The implementation integrates cleanly with the oh-my-opencode architecture, properly delegates to the existing Sisyphus agent (avoiding code duplication), and includes comprehensive provider-specific API key management for Anthropic, OpenAI, Google, and other cloud providers.

Confidence Score: 5/5

This PR is safe to merge with no identified issues or risks.
The code demonstrates strong quality across multiple dimensions: (1) Clean integration - properly extends Harbor's BaseInstalledAgent without modifying core oh-my-opencode logic, (2) Security - implements provider-based API key management with explicit key whitelisting rather than broad environment variable copying, (3) Error handling - gracefully handles missing output files and malformed JSON without crashing, (4) Architecture adherence - follows oh-my-opencode patterns by delegating to existing Sisyphus agent rather than reimplementing, (5) No dependencies introduced - only imports from Harbor SDK and standard library. The Jinja2 template is straightforward and appropriate for the use case. Testing is verified in PR description with successful hello-world run.
No files require special attention.

Important Files Changed

Filename	Overview
benchmark/sisyphus_agent.py	Harbor InstalledAgent implementation that runs OpenCode CLI with Sisyphus agent in headless mode. Parses OpenCode JSON output into ATIF trajectory format for metrics collection. Implementation is clean, follows Harbor patterns correctly, includes proper error handling for output file parsing, and securely manages environment variables by provider.
benchmark/install-sisyphus.sh.j2	Jinja2 shell script template for container setup. Installs NVM v0.40.2, Node.js 22, OpenCode, and oh-my-opencode with optional versioning. Script is straightforward with Jinja2 conditionals for version control. Note: Direct curl pipe to bash is a known pattern in the NVM installer ecosystem.
benchmark/init.py	Empty Python package marker file. Proper structure for making benchmark/ a Python package.
.gitignore	Added `__pycache__` and `jobs` entries to existing gitignore. Appropriate for Python benchmarks and Harbor's job artifact cleanup.

Sequence Diagram

sequenceDiagram
    actor Harbor as Harbor Framework
    participant SisyphusAgent as SisyphusAgent<br/>(benchmark/)
    participant OpenCode as OpenCode CLI
    participant Sisyphus as Sisyphus Agent<br/>(oh-my-opencode)
    participant ATIF as ATIF Trajectory<br/>JSON Output

    Harbor->>SisyphusAgent: run benchmark task
    SisyphusAgent->>SisyphusAgent: Resolve provider from model
    SisyphusAgent->>SisyphusAgent: Load provider API keys from env
    SisyphusAgent->>OpenCode: exec: opencode --model X run --agent Sisyphus --format=json
    OpenCode->>Sisyphus: initialize Sisyphus agent
    OpenCode->>Sisyphus: process instruction
    Sisyphus->>Sisyphus: generate code/solutions
    Sisyphus-->>OpenCode: emit JSON events (user/assistant)
    OpenCode-->>SisyphusAgent: stdout stream (JSON lines)
    SisyphusAgent->>SisyphusAgent: parse JSON events line-by-line
    SisyphusAgent->>SisyphusAgent: map to ATIF Steps
    SisyphusAgent->>ATIF: create Trajectory with metrics
    ATIF-->>Harbor: save trajectory.json + metrics
    SisyphusAgent-->>Harbor: populate context (tokens, cost)

greptile-apps · 2026-01-07T12:54:47Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0badc86c16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-07T12:54:56Z

benchmark/sisyphus_agent.py

+                command=(
+                    f"opencode --model {self.model_name} run "
+                    f"--agent Sisyphus --format=json {escaped_instruction} "
+                    f"2>&1 | tee /logs/agent/sisyphus.txt"


Preserve opencode failures in piped command

The command pipes opencode output through tee, which makes the shell exit status come from tee rather than opencode unless pipefail (or PIPESTATUS) is used. In runs where opencode fails (e.g., bad API key, CLI crash), the pipeline can still exit 0 because tee succeeds, so Harbor may treat the trial as successful and skip failure handling while producing a partial/empty trajectory. Consider enabling set -o pipefail or explicitly propagating opencode’s exit code before tee.

Useful? React with 👍 / 👎.

code-yeongyu · 2026-01-07T12:56:51Z

@sisyphus-dev-ai review

cubic-dev-ai

1 issue found across 4 files

Confidence score: 3/5

Pipeline in benchmark/sisyphus_agent.py swallows the opencode exit status because tee is last in the pipe, so real CLI failures (bad API key, crash) would go unnoticed.
Given this risk of silently passing failed executions, there’s a tangible chance of misleading benchmark outcomes, so I’d treat the merge with caution.
Pay close attention to benchmark/sisyphus_agent.py - ensure the pipeline preserves the opencode exit status.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="benchmark/sisyphus_agent.py">

<violation number="1" location="benchmark/sisyphus_agent.py:65">
P2: The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-07T12:58:09Z

benchmark/sisyphus_agent.py

+                command=(
+                    f"opencode --model {self.model_name} run "
+                    f"--agent Sisyphus --format=json {escaped_instruction} "
+                    f"2>&1 | tee /logs/agent/sisyphus.txt"


P2: The piped command loses the exit status of opencode because the shell returns the exit status of the last command in the pipeline (tee). If opencode fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since tee succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with bash -o pipefail -c '...' or using ${PIPESTATUS[0]} to preserve the exit code.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/sisyphus_agent.py, line 65: <comment>The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</comment> <file context> @@ -0,0 +1,248 @@ + command=( + f"opencode --model {self.model_name} run " + f"--agent Sisyphus --format=json {escaped_instruction} " + f"2>&1 | tee /logs/agent/sisyphus.txt" + ), + env=env, </file context>

oh-my-opencode CLI requires bun runtime which was missing in Docker container, causing setup to fail with 'bun: No such file or directory' (exit code 127).

sisyphus-dev-ai · 2026-01-07T13:00:04Z

Code Review by Sisyphus

Hi @code-yeongyu! I've completed a thorough review of PR #574. This is an excellent implementation of Harbor Terminal-Bench integration. The code is clean, well-structured, and follows solid engineering principles.

✅ Strengths

Architecture & Design:

Clean integration: Properly extends Harbor's BaseInstalledAgent without modifying core oh-my-opencode logic
Proper delegation: Correctly delegates to existing Sisyphus agent rather than reimplementing functionality
Separation of concerns: Installation script, agent wrapper, and ATIF conversion are well-separated

Security & Robustness:

Secure API key management: Implements provider-based whitelisting (lines 71-110) rather than blindly copying all environment variables
Comprehensive provider support: Supports 10 different model providers (Anthropic, OpenAI, Google, AWS, Azure, DeepSeek, GitHub, Groq, Hugging Face, Mistral, XAI, Llama)
Graceful error handling: Handles missing output files (line 114-115) and malformed JSON (lines 138-142) without crashing
Input sanitization: Uses shlex.quote() for shell escaping (line 50)

ATIF Conversion:

Complete trajectory mapping: Properly converts OpenCode JSON events to ATIF format with all required fields
Metrics tracking: Correctly aggregates token usage and costs across all events (lines 148-243)
Tool call & observation handling: Properly maps tool calls and results to ATIF format (lines 172-192)

.gitignore Changes:

Appropriate additions for Python cache (__pycache__) and Harbor job artifacts (jobs)

⚠️ Issues Requiring Attention

🔴 P2 - Exit Code Lost in Pipeline (benchmark/sisyphus_agent.py:62-65)

Issue: The command pipes opencode output through tee, which causes the shell exit status to come from tee rather than opencode:

command=(
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"2>&1 | tee /logs/agent/sisyphus.txt"
),

Impact:

If opencode fails (bad API key, CLI crash, etc.) but tee succeeds, the pipeline exits with 0 (success)
Harbor may treat the trial as successful and skip failure handling
This produces partial/empty trajectories without proper error detection

Recommended Fix:

command=(
    f"set -o pipefail; "
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"2>&1 | tee /logs/agent/sisyphus.txt"
),

or alternatively:

command=(
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"> >(tee /logs/agent/sisyphus.txt) 2>&1"
),

Credit: This issue was correctly identified by the Codex reviewer.

💡 Optional Enhancements

1. Model name validation (benchmark/sisyphus_agent.py:52-55)

Current code validates format but doesn't check if provider exists:

if not self.model_name or "/" not in self.model_name:
    raise ValueError("Model name must be in the format provider/model_name")

provider, _ = self.model_name.split("/", 1)

Consider adding:

provider, _ = self.model_name.split("/", 1)
if provider not in self.provider_keys:
    raise ValueError(f"Unknown provider {provider}. Supported: {list(self.provider_keys.keys())}")

This would provide clearer error messages for typos like anthropix/claude-opus-4-5.

2. Provider key extraction from `_get_provider_env` (benchmark/sisyphus_agent.py:71-110)

The provider-to-keys mapping (lines 75-100) could be extracted as a class constant for easier maintenance and testing:

class SisyphusAgent(BaseInstalledAgent):
    PROVIDER_KEYS = {
        "amazon-bedrock": [...],
        "anthropic": [...],
        # ... etc
    }
    
    def _get_provider_env(self, provider: str) -> dict[str, str]:
        keys = self.PROVIDER_KEYS.get(provider, [])
        if not keys:
            raise ValueError(f"Unknown provider {provider}")
        # ...

3. Installation script hardcoded options (benchmark/install-sisyphus.sh.j2:21, 23)

The script hardcodes --claude=yes --chatgpt=no --gemini=no:

npx oh-my-opencode@{{ omo_version }} install --no-tui --claude=yes --chatgpt=no --gemini=no

Consider making this configurable via template variables for different benchmark scenarios.

📝 Non-Issues

Package Manager Usage (npm/npx vs bun/bunx):
While the oh-my-opencode project convention prefers Bun (per AGENTS.md), the use of npm/npx in this benchmark integration is appropriate. This code runs in Harbor's container environment which standardizes on npm/Node.js. Using Bun here would require additional setup and dependencies.

🎯 Summary

Overall Assessment: ✅ APPROVE with one fix required

This is high-quality integration code that demonstrates:

Strong understanding of both Harbor and oh-my-opencode architectures
Good security practices
Comprehensive error handling
Clean separation of concerns

Required before merge: Fix the pipefail issue to ensure proper error propagation

Optional improvements: Consider the validation and refactoring suggestions above for long-term maintainability

Testing: PR description confirms successful hello-world run ✅

Great work on this integration! This will enable valuable benchmarking data for Sisyphus's performance.

Files reviewed:

✅ benchmark/sisyphus_agent.py (248 lines)
✅ benchmark/install-sisyphus.sh.j2 (27 lines)
✅ benchmark/init.py (empty package marker)
✅ .gitignore (2 additions)

Remove nvm/node/npm - bun handles everything: - bun install -g opencode-ai - bunx oh-my-opencode install

Remove ATIF trajectory parsing - keep only essential: - name(), install template path, run command - 263 lines → 74 lines

Switch from --claude=yes to --claude=no for oh-my-opencode install. This configures all subagents to use the free opencode/zen provider instead of paid Anthropic API.

The opencode/zen provider doesn't require an API key, so add it to provider_keys with an empty list.

``` Pre-create oh-my-opencode config before plugin install to prevent hook initialization during load, which can cause rate limits or unnecessary overhead in benchmark environments. ```

github-actions bot added a commit that referenced this pull request Jan 7, 2026

@minpeter has signed the CLA in #574

a54d9b1

chatgpt-codex-connector bot reviewed Jan 7, 2026

View reviewed changes

sisyphus-dev-ai added the sisyphus: working Sisyphus is currently working on this label Jan 7, 2026

minpeter marked this pull request as draft January 7, 2026 12:58

cubic-dev-ai bot reviewed Jan 7, 2026

View reviewed changes

fix: add bun installation to container setup

e2bfa71

oh-my-opencode CLI requires bun runtime which was missing in Docker container, causing setup to fail with 'bun: No such file or directory' (exit code 127).

sisyphus-dev-ai removed the sisyphus: working Sisyphus is currently working on this label Jan 7, 2026

minpeter added 5 commits January 7, 2026 22:03

refactor: simplify container setup using bun only

08c9e80

Remove nvm/node/npm - bun handles everything: - bun install -g opencode-ai - bunx oh-my-opencode install

refactor: simplify SisyphusAgent to minimal implementation

d5e4f13

Remove ATIF trajectory parsing - keep only essential: - name(), install template path, run command - 263 lines → 74 lines

feat: use opencode/zen provider to reduce benchmark costs

1d0dabb

Switch from --claude=yes to --claude=no for oh-my-opencode install. This configures all subagents to use the free opencode/zen provider instead of paid Anthropic API.

fix: add opencode provider support (no API key required)

042a9b1

The opencode/zen provider doesn't require an API key, so add it to provider_keys with an empty list.

Add pre-install config to disable noisy hooks in benchmarks

8eee7ad

``` Pre-create oh-my-opencode config before plugin install to prevent hook initialization during load, which can cause rate limits or unnecessary overhead in benchmark environments. ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Harbor Terminal-Bench integration for Sisyphus agent #574

feat: add Harbor Terminal-Bench integration for Sisyphus agent #574

minpeter commented Jan 7, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

minpeter commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 7, 2026

Uh oh!

code-yeongyu commented Jan 7, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Jan 7, 2026 •

edited

Loading

Uh oh!

sisyphus-dev-ai commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add Harbor Terminal-Bench integration for Sisyphus agent #574

Are you sure you want to change the base?

feat: add Harbor Terminal-Bench integration for Sisyphus agent #574

Conversation

minpeter commented Jan 7, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Results

Usage

Files Changed

Summary by cubic

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minpeter commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 7, 2026

Greptile found no issues!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

code-yeongyu commented Jan 7, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sisyphus-dev-ai commented Jan 7, 2026

Code Review by Sisyphus

✅ Strengths

⚠️ Issues Requiring Attention

🔴 P2 - Exit Code Lost in Pipeline (benchmark/sisyphus_agent.py:62-65)

💡 Optional Enhancements

1. Model name validation (benchmark/sisyphus_agent.py:52-55)

2. Provider key extraction from _get_provider_env (benchmark/sisyphus_agent.py:71-110)

3. Installation script hardcoded options (benchmark/install-sisyphus.sh.j2:21, 23)

📝 Non-Issues

🎯 Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

minpeter commented Jan 7, 2026 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading

cubic-dev-ai bot Jan 7, 2026 •

edited

Loading

2. Provider key extraction from `_get_provider_env` (benchmark/sisyphus_agent.py:71-110)