-
Notifications
You must be signed in to change notification settings - Fork 3.9k
🚧 [WIP] Add Terminal Bench benchmarking workflow #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,3 @@ | ||
| # Global code owner | ||
| * @localden | ||
| /benchmarks @adam-paterson |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| __pycache__/ | ||
| *.pyc | ||
| runs/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # Specify Terminal Bench Agent | ||
|
|
||
| This package provides Terminal Bench agents that drive the Spec -> Plan -> Tasks | ||
| workflow using the exact prompts and templates that ship with the Specify CLI. The | ||
| agents run outside the end-user CLI so benchmarking dependencies stay isolated. | ||
|
|
||
| ## Project layout | ||
|
|
||
| ``` | ||
| benchmarks/ | ||
| terminal_bench_agent/ | ||
| pyproject.toml # standalone uv project for benchmarking-only deps | ||
| README.md # this guide | ||
| specify_terminal_bench/ | ||
| __init__.py # package export | ||
| agent.py # workflow-aware agent definitions | ||
| prompt_templates/ # legacy prompt assets (unused by the new mixin) | ||
| ``` | ||
|
|
||
| ## Getting started | ||
|
|
||
| 1. Create an isolated environment for the benchmarking tools: | ||
| ```bash | ||
| cd benchmarks/terminal_bench_agent | ||
| uv sync | ||
| ``` | ||
| 2. (Optional) Export credentials for paid providers if you plan to use them | ||
| (e.g. `ANTHROPIC_API_KEY` for Claude Code). | ||
| 3. Run Terminal Bench with the OpenCode workflow agent and the public core dataset: | ||
| ```bash | ||
| uv run tb run \ | ||
| --dataset terminal-bench-core==head \ | ||
| --task-id hello-world \ | ||
| --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent | ||
| ``` | ||
| This defaults to the free `opencode/grok-code-fast-1` model. Provide | ||
| `--agent-kwarg model_name=<provider/model>` if you want another OpenCode target. | ||
| 4. To benchmark with Claude Code instead, switch the import path: | ||
| ```bash | ||
| uv run tb run \ | ||
| --dataset terminal-bench-core==head \ | ||
| --task-id hello-world \ | ||
| --agent-import-path specify_terminal_bench.agent:SpecifyClaudeWorkflowAgent \ | ||
| --agent-kwarg model_name=anthropic/claude-3-5-sonnet-20241022 | ||
| ``` | ||
|
|
||
| ## Customisation | ||
|
|
||
| - The agents assemble their prompts at runtime from the real Specify CLI sources: | ||
| `templates/commands/specify.md`, `plan.md`, `tasks.md` and their corresponding | ||
| templates. Update those files in the main repository to change benchmarking | ||
| behaviour. | ||
| - Pass additional keyword arguments through `--agent-kwarg` to reach provider specific | ||
| options (e.g. `version=...`). | ||
| - If you need a different provider entirely, subclass the desired Terminal Bench agent | ||
| under `specify_terminal_bench/agent.py` and reuse `SpecifyWorkflowMixin`. | ||
|
|
||
| ## Tips | ||
|
|
||
| - Terminal Bench requires Python 3.12+. The dedicated project keeps this dependency | ||
| separate from the end-user CLI which still targets Python 3.11. | ||
| - The agents read prompt assets from the repository root, so run benchmarks from the | ||
| root checkout. | ||
| - `uv run tb --help` lists additional switches (filtering tasks, resuming runs, etc.). |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| [project] | ||
| name = "specify-terminal-bench-agent" | ||
| version = "0.1.0" | ||
| description = "Terminal Bench agent that applies the Specify spec-driven workflow" | ||
| requires-python = ">=3.12" | ||
| dependencies = [ | ||
| "terminal-bench>=0.2.17", | ||
| ] | ||
|
|
||
| [project.readme] | ||
| file = "README.md" | ||
| content-type = "text/markdown" | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["specify_terminal_bench"] | ||
| include = ["specify_terminal_bench/**"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from .agent import SpecifyClaudeWorkflowAgent, SpecifyOpenCodeWorkflowAgent | ||
|
|
||
| __all__ = ["SpecifyClaudeWorkflowAgent", "SpecifyOpenCodeWorkflowAgent"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import os | ||
| import shlex | ||
| from functools import lru_cache | ||
| from pathlib import Path | ||
| from textwrap import dedent | ||
|
|
||
| from terminal_bench.agents.installed_agents.claude_code.claude_code_agent import ( | ||
| ClaudeCodeAgent, | ||
| ) | ||
| from terminal_bench.agents.installed_agents.opencode.opencode_agent import ( | ||
| OpenCodeAgent, | ||
| ) | ||
|
|
||
|
|
||
| def _repo_root() -> Path: | ||
| """Return the root of the Spec Kit repository.""" | ||
|
|
||
| return Path(__file__).resolve().parents[4] | ||
|
|
||
|
|
||
| def _read_text(path: Path) -> str: | ||
| try: | ||
| return path.read_text() | ||
| except FileNotFoundError as exc: # pragma: no cover - fail fast during benchmarks | ||
| raise RuntimeError( | ||
| f"Required prompt asset missing: {path}" | ||
| ) from exc | ||
|
|
||
|
|
||
| @lru_cache(maxsize=1) | ||
| def _prompt_assets() -> dict[str, str]: | ||
| """Load the canonical Spec -> Plan -> Tasks prompts and templates.""" | ||
|
|
||
| root = _repo_root() | ||
| return { | ||
| "spec_command": _read_text(root / "templates" / "commands" / "specify.md"), | ||
| "plan_command": _read_text(root / "templates" / "commands" / "plan.md"), | ||
| "tasks_command": _read_text(root / "templates" / "commands" / "tasks.md"), | ||
| "spec_template": _read_text(root / "templates" / "spec-template.md"), | ||
| "plan_template": _read_text(root / "templates" / "plan-template.md"), | ||
| "tasks_template": _read_text(root / "templates" / "tasks-template.md"), | ||
| } | ||
|
|
||
|
|
||
| def _build_workflow_prompt(instruction: str) -> str: | ||
| assets = _prompt_assets() | ||
|
|
||
| return dedent( | ||
| f""" | ||
| You are the Specify Spec Kit benchmarking agent. Your job is to apply the | ||
| Spec -> Plan -> Tasks workflow exactly as defined by the CLI prompts before | ||
| attempting any implementation work in the Terminal Bench task container. | ||
|
|
||
| Task instruction from Terminal Bench: | ||
| --- | ||
| {instruction.strip()} | ||
| --- | ||
|
|
||
| ## Workflow expectations | ||
| 1. Review repository context and any constitutions before acting. | ||
| 2. Produce SPECIFICATION, PLAN, and TASKS sections in that order using the | ||
| canonical prompts below. Do not start execution until all three are | ||
| drafted. | ||
| 3. After presenting the tasks list, print `BEGIN EXECUTION` and carry out the | ||
| tasks sequentially, announcing each task ID as you start and finish. | ||
| 4. Keep artefacts up to date as understanding evolves and run relevant tests | ||
| before concluding. | ||
|
|
||
| ## Canonical command prompts | ||
| These excerpts are copied directly from the Specify CLI. Use them verbatim when | ||
| constructing the SPECIFICATION, PLAN, and TASKS artefacts. | ||
|
|
||
| ### templates/commands/specify.md | ||
| ```markdown | ||
| {assets['spec_command'].strip()} | ||
| ``` | ||
|
|
||
| ### templates/commands/plan.md | ||
| ```markdown | ||
| {assets['plan_command'].strip()} | ||
| ``` | ||
|
|
||
| ### templates/commands/tasks.md | ||
| ```markdown | ||
| {assets['tasks_command'].strip()} | ||
| ``` | ||
|
|
||
| ## Canonical templates | ||
| Reference these structures while drafting the artefacts so they stay aligned | ||
| with the Specify CLI outputs. | ||
|
|
||
| ### templates/spec-template.md | ||
| ```markdown | ||
| {assets['spec_template'].strip()} | ||
| ``` | ||
|
|
||
| ### templates/plan-template.md | ||
| ```markdown | ||
| {assets['plan_template'].strip()} | ||
| ``` | ||
|
|
||
| ### templates/tasks-template.md | ||
| ```markdown | ||
| {assets['tasks_template'].strip()} | ||
| ``` | ||
|
|
||
| Proceed only after you have completed the SPECIFICATION, PLAN, and TASKS | ||
| sections above. Once `BEGIN EXECUTION` has been emitted, follow the plan to | ||
| completion or explain any blockers. | ||
| """ | ||
| ).strip() | ||
|
|
||
|
|
||
| class SpecifyWorkflowMixin: | ||
| """Override instruction rendering to include Spec Kit workflow guidance.""" | ||
|
|
||
| def _render_instruction(self, instruction: str) -> str: # type: ignore[override] | ||
| return _build_workflow_prompt(instruction) | ||
|
|
||
|
|
||
| class SpecifyClaudeWorkflowAgent(SpecifyWorkflowMixin, ClaudeCodeAgent): | ||
| """Claude Code agent preconfigured with the Spec -> Plan -> Tasks workflow.""" | ||
|
|
||
| @staticmethod | ||
| def name() -> str: | ||
| return "specify_claude_workflow" | ||
|
|
||
| def __init__(self, model_name: str | None = None, *args, **kwargs): | ||
| super().__init__(model_name=model_name, *args, **kwargs) | ||
|
|
||
|
|
||
| class SpecifyOpenCodeWorkflowAgent(SpecifyWorkflowMixin, OpenCodeAgent): | ||
| """OpenCode agent that drives the Spec -> Plan -> Tasks workflow.""" | ||
|
|
||
| _DEFAULT_MODEL = "opencode/grok-code-fast-1" | ||
|
|
||
| @staticmethod | ||
| def name() -> str: | ||
| return "specify_opencode_workflow" | ||
|
|
||
| def __init__(self, model_name: str | None = None, *args, **kwargs): | ||
| super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs) | ||
|
|
||
| def _render_instruction(self, instruction: str) -> str: # type: ignore[override] | ||
| # OpenCode uses stored prompts via `opencode run --command specify`, so pass | ||
| # through the raw task instruction and let the CLI command wrap it. | ||
| return instruction | ||
|
|
||
| def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]: | ||
|
||
| escaped_instruction = shlex.quote(instruction) | ||
| return [ | ||
| TerminalCommand( | ||
| command=( | ||
| f"opencode --model {self._model_name} -p specify run --command specify {escaped_instruction}" | ||
| ), | ||
| min_timeout_sec=0.0, | ||
| max_timeout_sec=float("inf"), | ||
| block=True, | ||
| append_enter=True, | ||
| ), | ||
| ] | ||
|
|
||
| @property | ||
| def _env(self) -> dict[str, str]: # type: ignore[override] | ||
| if getattr(self, "_provider", None) == "opencode": | ||
| # OpenCode public models do not require credentials, but allow an | ||
| # override if the user exports OPENCODE_API_KEY. | ||
| env: dict[str, str] = {} | ||
| if "OPENCODE_API_KEY" in os.environ: | ||
| env["OPENCODE_API_KEY"] = os.environ["OPENCODE_API_KEY"] | ||
| return env | ||
| return super()._env | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,13 @@ | ||||||||||||||||||||||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| apt-get update | ||||||||||||||||||||||||||||||||||||||||
| apt-get install -y curl | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+6
to
+7
|
||||||||||||||||||||||||||||||||||||||||
| curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | |
| # Download nvm install script | |
| NVM_VERSION="v0.40.2" | |
| NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh" | |
| NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh" | |
| curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT" | |
| # Expected SHA256 checksum for nvm v0.40.2 install.sh | |
| EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum | |
| ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')" | |
| if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then | |
| echo "ERROR: Checksum verification failed for nvm install.sh!" | |
| echo "Expected: $EXPECTED_SHA256" | |
| echo "Actual: $ACTUAL_SHA256" | |
| exit 1 | |
| fi | |
| bash "$NVM_INSTALL_SCRIPT" |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,23 @@ | ||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||
| set -euo pipefail | ||||||||||||||||||||
|
Comment on lines
+1
to
+2
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| apt-get update | ||||||||||||||||||||
| apt-get install -y curl git python3 python3-venv | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Install Node ecosystem for OpenCode CLI | ||||||||||||||||||||
| curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
Comment on lines
+8
to
+9
|
||||||||||||||||||||
| curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | |
| NVM_VERSION="v0.40.2" | |
| NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh" | |
| NVM_INSTALL_SH="/tmp/nvm-install.sh" | |
| NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release | |
| curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH" | |
| echo "${NVM_INSTALL_SH_SHA256} $NVM_INSTALL_SH" | sha256sum -c - | |
| bash "$NVM_INSTALL_SH" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| {# Specify workflow template used by SpecifyClaudeWorkflowAgent #} | ||
| You are the Specify CLI benchmarking agent. You are being evaluated on your ability to run | ||
| a disciplined specification-driven workflow inside a terminal-only environment. | ||
|
|
||
| Task instruction from the benchmark: | ||
| {{ instruction }} | ||
|
|
||
| Always follow the Spec -> Plan -> Tasks (SPT) workflow *before* writing or editing | ||
| production code. Keep output concise and actionable for terminal use. | ||
|
|
||
| Checklist before you begin coding: | ||
| 1. Inspect repository metadata (README, docs/, tests/) to understand context. | ||
| 2. If a constitution or non-negotiable guidelines file exists (COMMON PATHS: | ||
| `CONSTITUTION.md`, `/memory/constitution.md`, `.specify/constitution.md`), read it | ||
| and obey it throughout the session. | ||
| 3. Capture any missing information as questions rather than assumptions. | ||
|
|
||
| Deliver the following artefacts in your first response *in this order*: | ||
|
|
||
| SPECIFICATION | ||
| - Summarise the desired behaviour and the user impact in plain language. | ||
| - Highlight scope boundaries and success criteria. | ||
| - Record open questions as bullet points prefixed with `NEEDS CLARIFICATION:`. | ||
|
|
||
| PLAN | ||
| - Produce an ordered implementation strategy (5-10 steps). | ||
| - Note which files you expect to touch and why. | ||
| - Identify validation steps (tests, linters, sanity checks). | ||
|
|
||
| TASKS | ||
| - Emit a numbered task list (T001, T002, ...) in dependency order. | ||
| - Tag tasks that can run in parallel with `[P]`. | ||
| - Each task must describe the concrete change plus the command(s) you will run. | ||
|
|
||
| After you print the task list, explicitly write `BEGIN EXECUTION` and then carry out | ||
| the tasks sequentially. While executing: | ||
| - Announce the task ID when starting or finishing a task. | ||
| - Keep artefacts up to date (update spec/plan/tasks sections in your messages when the | ||
| understanding changes). | ||
| - Prefer small, reviewable commits; run repository tests when they exist. | ||
| - Exit paging programs (`less`, editors) immediately after retrieving the needed output. | ||
|
|
||
| Completion requirements: | ||
| - All tasks have been executed or intentionally skipped with justification. | ||
| - Tests relevant to the change have been run (and rerun after fixes). | ||
| - Final message contains: summary of changes, test evidence, any follow-up work. | ||
|
|
||
| If at any point requirements conflict with the constitution or repository tests fail, | ||
| stop progressing and explain what blocks you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
SpecifyOpenCodeWorkflowAgentoverrides_render_instructionto bypass the workflow mixin's implementation, which contradicts the class inheritance design. Consider using composition instead of inheritance, or restructuring the mixin to make this override pattern more explicit.