-
Notifications
You must be signed in to change notification settings - Fork 793
feat: add Harbor Terminal-Bench integration for Sisyphus agent #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Add benchmark infrastructure to evaluate Sisyphus on Harbor Terminal-Bench: - SisyphusAgent: Custom Harbor InstalledAgent that runs OpenCode with Sisyphus - install-sisyphus.sh.j2: Jinja2 template for container setup - ATIF trajectory parsing for metrics collection Tested with hello-world@1.0 dataset (100% pass rate)
|
All contributors have signed the CLA. Thank you! ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
Greptile SummaryThis PR adds Harbor Terminal-Bench integration for the Sisyphus agent, enabling automated benchmarking of the agent's code generation capabilities against standard datasets. The implementation creates a custom Harbor Key changes:
The implementation integrates cleanly with the oh-my-opencode architecture, properly delegates to the existing Sisyphus agent (avoiding code duplication), and includes comprehensive provider-specific API key management for Anthropic, OpenAI, Google, and other cloud providers. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
actor Harbor as Harbor Framework
participant SisyphusAgent as SisyphusAgent<br/>(benchmark/)
participant OpenCode as OpenCode CLI
participant Sisyphus as Sisyphus Agent<br/>(oh-my-opencode)
participant ATIF as ATIF Trajectory<br/>JSON Output
Harbor->>SisyphusAgent: run benchmark task
SisyphusAgent->>SisyphusAgent: Resolve provider from model
SisyphusAgent->>SisyphusAgent: Load provider API keys from env
SisyphusAgent->>OpenCode: exec: opencode --model X run --agent Sisyphus --format=json
OpenCode->>Sisyphus: initialize Sisyphus agent
OpenCode->>Sisyphus: process instruction
Sisyphus->>Sisyphus: generate code/solutions
Sisyphus-->>OpenCode: emit JSON events (user/assistant)
OpenCode-->>SisyphusAgent: stdout stream (JSON lines)
SisyphusAgent->>SisyphusAgent: parse JSON events line-by-line
SisyphusAgent->>SisyphusAgent: map to ATIF Steps
SisyphusAgent->>ATIF: create Trajectory with metrics
ATIF-->>Harbor: save trajectory.json + metrics
SisyphusAgent-->>Harbor: populate context (tokens, cost)
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0badc86c16
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| command=( | ||
| f"opencode --model {self.model_name} run " | ||
| f"--agent Sisyphus --format=json {escaped_instruction} " | ||
| f"2>&1 | tee /logs/agent/sisyphus.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve opencode failures in piped command
The command pipes opencode output through tee, which makes the shell exit status come from tee rather than opencode unless pipefail (or PIPESTATUS) is used. In runs where opencode fails (e.g., bad API key, CLI crash), the pipeline can still exit 0 because tee succeeds, so Harbor may treat the trial as successful and skip failure handling while producing a partial/empty trajectory. Consider enabling set -o pipefail or explicitly propagating opencode’s exit code before tee.
Useful? React with 👍 / 👎.
|
@sisyphus-dev-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 4 files
Confidence score: 3/5
- Pipeline in
benchmark/sisyphus_agent.pyswallows theopencodeexit status becauseteeis last in the pipe, so real CLI failures (bad API key, crash) would go unnoticed. - Given this risk of silently passing failed executions, there’s a tangible chance of misleading benchmark outcomes, so I’d treat the merge with caution.
- Pay close attention to
benchmark/sisyphus_agent.py- ensure the pipeline preserves theopencodeexit status.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="benchmark/sisyphus_agent.py">
<violation number="1" location="benchmark/sisyphus_agent.py:65">
P2: The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| command=( | ||
| f"opencode --model {self.model_name} run " | ||
| f"--agent Sisyphus --format=json {escaped_instruction} " | ||
| f"2>&1 | tee /logs/agent/sisyphus.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The piped command loses the exit status of opencode because the shell returns the exit status of the last command in the pipeline (tee). If opencode fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since tee succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with bash -o pipefail -c '...' or using ${PIPESTATUS[0]} to preserve the exit code.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/sisyphus_agent.py, line 65:
<comment>The piped command loses the exit status of `opencode` because the shell returns the exit status of the last command in the pipeline (`tee`). If `opencode` fails (e.g., bad API key, CLI crash), the pipeline will still exit 0 since `tee` succeeds, potentially causing failed runs to be treated as successful. Consider wrapping with `bash -o pipefail -c '...'` or using `${PIPESTATUS[0]}` to preserve the exit code.</comment>
<file context>
@@ -0,0 +1,248 @@
+ command=(
+ f"opencode --model {self.model_name} run "
+ f"--agent Sisyphus --format=json {escaped_instruction} "
+ f"2>&1 | tee /logs/agent/sisyphus.txt"
+ ),
+ env=env,
</file context>
oh-my-opencode CLI requires bun runtime which was missing in Docker container, causing setup to fail with 'bun: No such file or directory' (exit code 127).
Code Review by SisyphusHi @code-yeongyu! I've completed a thorough review of PR #574. This is an excellent implementation of Harbor Terminal-Bench integration. The code is clean, well-structured, and follows solid engineering principles. ✅ StrengthsArchitecture & Design:
Security & Robustness:
ATIF Conversion:
.gitignore Changes:
|
Remove nvm/node/npm - bun handles everything: - bun install -g opencode-ai - bunx oh-my-opencode install
Remove ATIF trajectory parsing - keep only essential: - name(), install template path, run command - 263 lines → 74 lines
Switch from --claude=yes to --claude=no for oh-my-opencode install. This configures all subagents to use the free opencode/zen provider instead of paid Anthropic API.
The opencode/zen provider doesn't require an API key, so add it to provider_keys with an empty list.
``` Pre-create oh-my-opencode config before plugin install to prevent hook initialization during load, which can cause rate limits or unnecessary overhead in benchmark environments. ```
Summary
Add benchmark infrastructure to evaluate Sisyphus agent on Harbor Terminal-Bench.
InstalledAgentthat runs OpenCode CLI with--agent Sisyphusin headless modeTest Results
Verified with
hello-world@1.0dataset:Usage
Files Changed
benchmark/sisyphus_agent.pybenchmark/install-sisyphus.sh.j2benchmark/__init__.py.gitignore__pycache__,jobsSummary by cubic
Adds Harbor Terminal-Bench integration for the Sisyphus agent, running OpenCode in headless mode with provider-aware setup. This enables consistent benchmarking across Harbor datasets.
New Features
Migration
Written for commit 8eee7ad. Summary will update on new commits.