Skip to content

Conversation

@minpeter
Copy link

@minpeter minpeter commented Jan 7, 2026

Summary

Add benchmark infrastructure to evaluate Sisyphus agent on Harbor Terminal-Bench.

  • SisyphusAgent: Custom Harbor InstalledAgent that runs OpenCode CLI with --agent Sisyphus in headless mode
  • install-sisyphus.sh.j2: Jinja2 template for container setup (Node.js, OpenCode, oh-my-opencode)
  • ATIF trajectory parsing: Converts OpenCode JSON output to Harbor's trajectory format for metrics collection

Test Results

Verified with hello-world@1.0 dataset:

Agent: sisyphus (claude-sonnet-4-5-20250929)
Dataset: hello-world
Trials: 1 | Errors: 0 | Mean: 1.000 (100% pass)

Usage

# Quick validation
PYTHONPATH=/path/to/oh-my-opencode harbor run -d 'hello-world@1.0' \
  --agent-import-path benchmark.sisyphus_agent:SisyphusAgent \
  -m anthropic/claude-sonnet-4-5-20250929 -n 1

# Full Terminal-Bench 2.0
PYTHONPATH=/path/to/oh-my-opencode harbor run -d 'terminal-bench@2.0' \
  --agent-import-path benchmark.sisyphus_agent:SisyphusAgent \
  -m anthropic/claude-sonnet-4-5-20250929 -n 4

Files Changed

File Description
benchmark/sisyphus_agent.py Harbor InstalledAgent implementation
benchmark/install-sisyphus.sh.j2 Container setup template
benchmark/__init__.py Python package init
.gitignore Added __pycache__, jobs

Summary by cubic

Adds Harbor Terminal-Bench integration for the Sisyphus agent, running OpenCode in headless mode with provider-aware setup. This enables consistent benchmarking across Harbor datasets.

  • New Features

    • SisyphusAgent: Runs "opencode --model <provider/model> run --agent Sisyphus --format=json" with provider-specific env; logs to /logs/agent/sisyphus.txt.
    • install-sisyphus.sh.j2: Installs bun, OpenCode, and oh-my-opencode; supports pinned versions; disables noisy hooks; defaults to opencode/zen (free) to reduce benchmark costs.
    • .gitignore: Adds pycache and jobs.
  • Migration

    • Use model names as provider/model (e.g., anthropic/claude-sonnet-4-5) and export required API keys for that provider.
    • Run Harbor with --agent-import-path benchmark.sisyphus_agent:SisyphusAgent against desired datasets.

Written for commit 8eee7ad. Summary will update on new commits.

Add benchmark infrastructure to evaluate Sisyphus on Harbor Terminal-Bench:
- SisyphusAgent: Custom Harbor InstalledAgent that runs OpenCode with Sisyphus
- install-sisyphus.sh.j2: Jinja2 template for container setup
- ATIF trajectory parsing for metrics collection

Tested with hello-world@1.0 dataset (100% pass rate)
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

All contributors have signed the CLA. Thank you! ✅
Posted by the CLA Assistant Lite bot.

@minpeter
Copy link
Author

minpeter commented Jan 7, 2026

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Jan 7, 2026
@greptile-apps
Copy link

greptile-apps bot commented Jan 7, 2026

Greptile Summary

This PR adds Harbor Terminal-Bench integration for the Sisyphus agent, enabling automated benchmarking of the agent's code generation capabilities against standard datasets. The implementation creates a custom Harbor InstalledAgent that wraps the OpenCode CLI, automatically manages model provider API keys, and converts OpenCode's JSON event output into Harbor's ATIF (Agent Trajectory Interchange Format) for metrics collection.

Key changes:

  • benchmark/sisyphus_agent.py: Harbor-compatible agent wrapper that orchestrates the OpenCode CLI with Sisyphus in headless mode and parses ATIF trajectories
  • benchmark/install-sisyphus.sh.j2: Container setup template installing NVM, Node.js, OpenCode, and oh-my-opencode with configurable versions
  • .gitignore: Added Python cache and Harbor job artifacts directories

The implementation integrates cleanly with the oh-my-opencode architecture, properly delegates to the existing Sisyphus agent (avoiding code duplication), and includes comprehensive provider-specific API key management for Anthropic, OpenAI, Google, and other cloud providers.

Confidence Score: 5/5

  • This PR is safe to merge with no identified issues or risks.
  • The code demonstrates strong quality across multiple dimensions: (1) Clean integration - properly extends Harbor's BaseInstalledAgent without modifying core oh-my-opencode logic, (2) Security - implements provider-based API key management with explicit key whitelisting rather than broad environment variable copying, (3) Error handling - gracefully handles missing output files and malformed JSON without crashing, (4) Architecture adherence - follows oh-my-opencode patterns by delegating to existing Sisyphus agent rather than reimplementing, (5) No dependencies introduced - only imports from Harbor SDK and standard library. The Jinja2 template is straightforward and appropriate for the use case. Testing is verified in PR description with successful hello-world run.
  • No files require special attention.

Important Files Changed

Filename Overview
benchmark/sisyphus_agent.py Harbor InstalledAgent implementation that runs OpenCode CLI with Sisyphus agent in headless mode. Parses OpenCode JSON output into ATIF trajectory format for metrics collection. Implementation is clean, follows Harbor patterns correctly, includes proper error handling for output file parsing, and securely manages environment variables by provider.
benchmark/install-sisyphus.sh.j2 Jinja2 shell script template for container setup. Installs NVM v0.40.2, Node.js 22, OpenCode, and oh-my-opencode with optional versioning. Script is straightforward with Jinja2 conditionals for version control. Note: Direct curl pipe to bash is a known pattern in the NVM installer ecosystem.
benchmark/init.py Empty Python package marker file. Proper structure for making benchmark/ a Python package.
.gitignore Added __pycache__ and jobs entries to existing gitignore. Appropriate for Python benchmarks and Harbor's job artifact cleanup.

Sequence Diagram

sequenceDiagram
    actor Harbor as Harbor Framework
    participant SisyphusAgent as SisyphusAgent<br/>(benchmark/)
    participant OpenCode as OpenCode CLI
    participant Sisyphus as Sisyphus Agent<br/>(oh-my-opencode)
    participant ATIF as ATIF Trajectory<br/>JSON Output

    Harbor->>SisyphusAgent: run benchmark task
    SisyphusAgent->>SisyphusAgent: Resolve provider from model
    SisyphusAgent->>SisyphusAgent: Load provider API keys from env
    SisyphusAgent->>OpenCode: exec: opencode --model X run --agent Sisyphus --format=json
    OpenCode->>Sisyphus: initialize Sisyphus agent
    OpenCode->>Sisyphus: process instruction
    Sisyphus->>Sisyphus: generate code/solutions
    Sisyphus-->>OpenCode: emit JSON events (user/assistant)
    OpenCode-->>SisyphusAgent: stdout stream (JSON lines)
    SisyphusAgent->>SisyphusAgent: parse JSON events line-by-line
    SisyphusAgent->>SisyphusAgent: map to ATIF Steps
    SisyphusAgent->>ATIF: create Trajectory with metrics
    ATIF-->>Harbor: save trajectory.json + metrics
    SisyphusAgent-->>Harbor: populate context (tokens, cost)
Loading

@greptile-apps
Copy link

greptile-apps bot commented Jan 7, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0badc86c16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +65
command=(
f"opencode --model {self.model_name} run "
f"--agent Sisyphus --format=json {escaped_instruction} "
f"2>&1 | tee /logs/agent/sisyphus.txt"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve opencode failures in piped command

The command pipes opencode output through tee, which makes the shell exit status come from tee rather than opencode unless pipefail (or PIPESTATUS) is used. In runs where opencode fails (e.g., bad API key, CLI crash), the pipeline can still exit 0 because tee succeeds, so Harbor may treat the trial as successful and skip failure handling while producing a partial/empty trajectory. Consider enabling set -o pipefail or explicitly propagating opencode’s exit code before tee.

Useful? React with 👍 / 👎.

@code-yeongyu
Copy link
Owner

@sisyphus-dev-ai review

@sisyphus-dev-ai sisyphus-dev-ai added the sisyphus: working Sisyphus is currently working on this label Jan 7, 2026
@minpeter minpeter marked this pull request as draft January 7, 2026 12:58
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Confidence score: 3/5

  • Pipeline in benchmark/sisyphus_agent.py swallows the opencode exit status because tee is last in the pipe, so real CLI failures (bad API key, crash) would go unnoticed.
  • Given this risk of silently passing failed executions, there’s a tangible chance of misleading benchmark outcomes, so I’d treat the merge with caution.
  • Pay close attention to benchmark/sisyphus_agent.py - ensure the pipeline preserves the opencode exit status.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="benchmark/sisyphus_agent.py">

<violation number="1" location="benchmark/sisyphus_agent.py:65">
P2: The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

command=(
f"opencode --model {self.model_name} run "
f"--agent Sisyphus --format=json {escaped_instruction} "
f"2>&1 | tee /logs/agent/sisyphus.txt"
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The piped command loses the exit status of opencode because the shell returns the exit status of the last command in the pipeline (tee). If opencode fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since tee succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with bash -o pipefail -c '...' or using ${PIPESTATUS[0]} to preserve the exit code.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/sisyphus_agent.py, line 65:

<comment>The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</comment>

<file context>
@@ -0,0 +1,248 @@
+                command=(
+                    f"opencode --model {self.model_name} run "
+                    f"--agent Sisyphus --format=json {escaped_instruction} "
+                    f"2>&1 | tee /logs/agent/sisyphus.txt"
+                ),
+                env=env,
</file context>
Fix with Cubic

oh-my-opencode CLI requires bun runtime which was missing in Docker container,
causing setup to fail with 'bun: No such file or directory' (exit code 127).
@sisyphus-dev-ai
Copy link
Collaborator

Code Review by Sisyphus

Hi @code-yeongyu! I've completed a thorough review of PR #574. This is an excellent implementation of Harbor Terminal-Bench integration. The code is clean, well-structured, and follows solid engineering principles.

✅ Strengths

Architecture & Design:

  • Clean integration: Properly extends Harbor's BaseInstalledAgent without modifying core oh-my-opencode logic
  • Proper delegation: Correctly delegates to existing Sisyphus agent rather than reimplementing functionality
  • Separation of concerns: Installation script, agent wrapper, and ATIF conversion are well-separated

Security & Robustness:

  • Secure API key management: Implements provider-based whitelisting (lines 71-110) rather than blindly copying all environment variables
  • Comprehensive provider support: Supports 10 different model providers (Anthropic, OpenAI, Google, AWS, Azure, DeepSeek, GitHub, Groq, Hugging Face, Mistral, XAI, Llama)
  • Graceful error handling: Handles missing output files (line 114-115) and malformed JSON (lines 138-142) without crashing
  • Input sanitization: Uses shlex.quote() for shell escaping (line 50)

ATIF Conversion:

  • Complete trajectory mapping: Properly converts OpenCode JSON events to ATIF format with all required fields
  • Metrics tracking: Correctly aggregates token usage and costs across all events (lines 148-243)
  • Tool call & observation handling: Properly maps tool calls and results to ATIF format (lines 172-192)

.gitignore Changes:

  • Appropriate additions for Python cache (__pycache__) and Harbor job artifacts (jobs)

⚠️ Issues Requiring Attention

🔴 P2 - Exit Code Lost in Pipeline (benchmark/sisyphus_agent.py:62-65)

Issue: The command pipes opencode output through tee, which causes the shell exit status to come from tee rather than opencode:

command=(
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"2>&1 | tee /logs/agent/sisyphus.txt"
),

Impact:

  • If opencode fails (bad API key, CLI crash, etc.) but tee succeeds, the pipeline exits with 0 (success)
  • Harbor may treat the trial as successful and skip failure handling
  • This produces partial/empty trajectories without proper error detection

Recommended Fix:

command=(
    f"set -o pipefail; "
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"2>&1 | tee /logs/agent/sisyphus.txt"
),

or alternatively:

command=(
    f"opencode --model {self.model_name} run "
    f"--agent Sisyphus --format=json {escaped_instruction} "
    f"> >(tee /logs/agent/sisyphus.txt) 2>&1"
),

Credit: This issue was correctly identified by the Codex reviewer.


💡 Optional Enhancements

1. Model name validation (benchmark/sisyphus_agent.py:52-55)

Current code validates format but doesn't check if provider exists:

if not self.model_name or "/" not in self.model_name:
    raise ValueError("Model name must be in the format provider/model_name")

provider, _ = self.model_name.split("/", 1)

Consider adding:

provider, _ = self.model_name.split("/", 1)
if provider not in self.provider_keys:
    raise ValueError(f"Unknown provider {provider}. Supported: {list(self.provider_keys.keys())}")

This would provide clearer error messages for typos like anthropix/claude-opus-4-5.

2. Provider key extraction from _get_provider_env (benchmark/sisyphus_agent.py:71-110)

The provider-to-keys mapping (lines 75-100) could be extracted as a class constant for easier maintenance and testing:

class SisyphusAgent(BaseInstalledAgent):
    PROVIDER_KEYS = {
        "amazon-bedrock": [...],
        "anthropic": [...],
        # ... etc
    }
    
    def _get_provider_env(self, provider: str) -> dict[str, str]:
        keys = self.PROVIDER_KEYS.get(provider, [])
        if not keys:
            raise ValueError(f"Unknown provider {provider}")
        # ...

3. Installation script hardcoded options (benchmark/install-sisyphus.sh.j2:21, 23)

The script hardcodes --claude=yes --chatgpt=no --gemini=no:

npx oh-my-opencode@{{ omo_version }} install --no-tui --claude=yes --chatgpt=no --gemini=no

Consider making this configurable via template variables for different benchmark scenarios.


📝 Non-Issues

Package Manager Usage (npm/npx vs bun/bunx):
While the oh-my-opencode project convention prefers Bun (per AGENTS.md), the use of npm/npx in this benchmark integration is appropriate. This code runs in Harbor's container environment which standardizes on npm/Node.js. Using Bun here would require additional setup and dependencies.


🎯 Summary

Overall Assessment: ✅ APPROVE with one fix required

This is high-quality integration code that demonstrates:

  • Strong understanding of both Harbor and oh-my-opencode architectures
  • Good security practices
  • Comprehensive error handling
  • Clean separation of concerns

Required before merge: Fix the pipefail issue to ensure proper error propagation

Optional improvements: Consider the validation and refactoring suggestions above for long-term maintainability

Testing: PR description confirms successful hello-world run ✅


Great work on this integration! This will enable valuable benchmarking data for Sisyphus's performance.

Files reviewed:

  • ✅ benchmark/sisyphus_agent.py (248 lines)
  • ✅ benchmark/install-sisyphus.sh.j2 (27 lines)
  • ✅ benchmark/init.py (empty package marker)
  • ✅ .gitignore (2 additions)

@sisyphus-dev-ai sisyphus-dev-ai removed the sisyphus: working Sisyphus is currently working on this label Jan 7, 2026
Remove nvm/node/npm - bun handles everything:
- bun install -g opencode-ai
- bunx oh-my-opencode install
Remove ATIF trajectory parsing - keep only essential:
- name(), install template path, run command
- 263 lines → 74 lines
Switch from --claude=yes to --claude=no for oh-my-opencode install.
This configures all subagents to use the free opencode/zen provider
instead of paid Anthropic API.
The opencode/zen provider doesn't require an API key, so add it to
provider_keys with an empty list.
``` Pre-create oh-my-opencode config before plugin install to prevent
hook initialization during load, which can cause rate limits or
unnecessary overhead in benchmark environments. ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants