diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 27fe556c5..9b83d6163 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,2 +1,3 @@ # Global code owner * @localden +/benchmarks @adam-paterson diff --git a/README.md b/README.md index a919545c4..e5dd115c1 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,7 @@ - [🎯 Experimental goals](#-experimental-goals) - [🔧 Prerequisites](#-prerequisites) - [📖 Learn more](#-learn-more) +- [📊 Benchmarking with Terminal Bench](#-benchmarking-with-terminal-bench) - [📋 Detailed process](#-detailed-process) - [🔍 Troubleshooting](#-troubleshooting) - [👥 Maintainers](#-maintainers) @@ -180,6 +181,26 @@ Our research and experimentation focus on: - **[Complete Spec-Driven Development Methodology](./spec-driven.md)** - Deep dive into the full process - **[Detailed Walkthrough](#-detailed-process)** - Step-by-step implementation guide +## 📊 Benchmarking with Terminal Bench + +Benchmark the Specify workflow without impacting end users by using the standalone +Terminal Bench agent that lives in `benchmarks/terminal_bench_agent`. The project is +managed with uv and keeps heavy benchmarking dependencies separate from the main CLI. + +```bash +cd benchmarks/terminal_bench_agent +uv sync +uv run tb run \ + --dataset terminal-bench-core==head \ + --task-id hello-world \ + --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent +``` + +Set provider credentials only if you switch to a paid model (for example export +`ANTHROPIC_API_KEY` before using the Claude workflow agent). See +[`benchmarks/terminal_bench_agent/README.md`](benchmarks/terminal_bench_agent/README.md) +for detailed options and overrides. + --- ## 📋 Detailed process diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore new file mode 100644 index 000000000..a79612b6c --- /dev/null +++ b/benchmarks/.gitignore @@ -0,0 +1,3 @@ +__pycache__/ +*.pyc +runs/ diff --git a/benchmarks/terminal_bench_agent/README.md b/benchmarks/terminal_bench_agent/README.md new file mode 100644 index 000000000..20f519a4d --- /dev/null +++ b/benchmarks/terminal_bench_agent/README.md @@ -0,0 +1,64 @@ +# Specify Terminal Bench Agent + +This package provides Terminal Bench agents that drive the Spec -> Plan -> Tasks +workflow using the exact prompts and templates that ship with the Specify CLI. The +agents run outside the end-user CLI so benchmarking dependencies stay isolated. + +## Project layout + +``` +benchmarks/ + terminal_bench_agent/ + pyproject.toml # standalone uv project for benchmarking-only deps + README.md # this guide + specify_terminal_bench/ + __init__.py # package export + agent.py # workflow-aware agent definitions + prompt_templates/ # legacy prompt assets (unused by the new mixin) +``` + +## Getting started + +1. Create an isolated environment for the benchmarking tools: + ```bash + cd benchmarks/terminal_bench_agent + uv sync + ``` +2. (Optional) Export credentials for paid providers if you plan to use them + (e.g. `ANTHROPIC_API_KEY` for Claude Code). +3. Run Terminal Bench with the OpenCode workflow agent and the public core dataset: + ```bash + uv run tb run \ + --dataset terminal-bench-core==head \ + --task-id hello-world \ + --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent + ``` + This defaults to the free `opencode/grok-code-fast-1` model. Provide + `--agent-kwarg model_name=` if you want another OpenCode target. +4. To benchmark with Claude Code instead, switch the import path: + ```bash + uv run tb run \ + --dataset terminal-bench-core==head \ + --task-id hello-world \ + --agent-import-path specify_terminal_bench.agent:SpecifyClaudeWorkflowAgent \ + --agent-kwarg model_name=anthropic/claude-3-5-sonnet-20241022 + ``` + +## Customisation + +- The agents assemble their prompts at runtime from the real Specify CLI sources: + `templates/commands/specify.md`, `plan.md`, `tasks.md` and their corresponding + templates. Update those files in the main repository to change benchmarking + behaviour. +- Pass additional keyword arguments through `--agent-kwarg` to reach provider specific + options (e.g. `version=...`). +- If you need a different provider entirely, subclass the desired Terminal Bench agent + under `specify_terminal_bench/agent.py` and reuse `SpecifyWorkflowMixin`. + +## Tips + +- Terminal Bench requires Python 3.12+. The dedicated project keeps this dependency + separate from the end-user CLI which still targets Python 3.11. +- The agents read prompt assets from the repository root, so run benchmarks from the + root checkout. +- `uv run tb --help` lists additional switches (filtering tasks, resuming runs, etc.). diff --git a/benchmarks/terminal_bench_agent/pyproject.toml b/benchmarks/terminal_bench_agent/pyproject.toml new file mode 100644 index 000000000..983821509 --- /dev/null +++ b/benchmarks/terminal_bench_agent/pyproject.toml @@ -0,0 +1,20 @@ +[project] +name = "specify-terminal-bench-agent" +version = "0.1.0" +description = "Terminal Bench agent that applies the Specify spec-driven workflow" +requires-python = ">=3.12" +dependencies = [ + "terminal-bench>=0.2.17", +] + +[project.readme] +file = "README.md" +content-type = "text/markdown" + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[tool.hatch.build.targets.wheel] +packages = ["specify_terminal_bench"] +include = ["specify_terminal_bench/**"] diff --git a/benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py b/benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py new file mode 100644 index 000000000..fe1ed96a6 --- /dev/null +++ b/benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py @@ -0,0 +1,3 @@ +from .agent import SpecifyClaudeWorkflowAgent, SpecifyOpenCodeWorkflowAgent + +__all__ = ["SpecifyClaudeWorkflowAgent", "SpecifyOpenCodeWorkflowAgent"] diff --git a/benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py b/benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py new file mode 100644 index 000000000..dcfe211fc --- /dev/null +++ b/benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py @@ -0,0 +1,174 @@ +from __future__ import annotations + +import os +import shlex +from functools import lru_cache +from pathlib import Path +from textwrap import dedent + +from terminal_bench.agents.installed_agents.claude_code.claude_code_agent import ( + ClaudeCodeAgent, +) +from terminal_bench.agents.installed_agents.opencode.opencode_agent import ( + OpenCodeAgent, +) + + +def _repo_root() -> Path: + """Return the root of the Spec Kit repository.""" + + return Path(__file__).resolve().parents[4] + + +def _read_text(path: Path) -> str: + try: + return path.read_text() + except FileNotFoundError as exc: # pragma: no cover - fail fast during benchmarks + raise RuntimeError( + f"Required prompt asset missing: {path}" + ) from exc + + +@lru_cache(maxsize=1) +def _prompt_assets() -> dict[str, str]: + """Load the canonical Spec -> Plan -> Tasks prompts and templates.""" + + root = _repo_root() + return { + "spec_command": _read_text(root / "templates" / "commands" / "specify.md"), + "plan_command": _read_text(root / "templates" / "commands" / "plan.md"), + "tasks_command": _read_text(root / "templates" / "commands" / "tasks.md"), + "spec_template": _read_text(root / "templates" / "spec-template.md"), + "plan_template": _read_text(root / "templates" / "plan-template.md"), + "tasks_template": _read_text(root / "templates" / "tasks-template.md"), + } + + +def _build_workflow_prompt(instruction: str) -> str: + assets = _prompt_assets() + + return dedent( + f""" + You are the Specify Spec Kit benchmarking agent. Your job is to apply the + Spec -> Plan -> Tasks workflow exactly as defined by the CLI prompts before + attempting any implementation work in the Terminal Bench task container. + + Task instruction from Terminal Bench: + --- + {instruction.strip()} + --- + + ## Workflow expectations + 1. Review repository context and any constitutions before acting. + 2. Produce SPECIFICATION, PLAN, and TASKS sections in that order using the + canonical prompts below. Do not start execution until all three are + drafted. + 3. After presenting the tasks list, print `BEGIN EXECUTION` and carry out the + tasks sequentially, announcing each task ID as you start and finish. + 4. Keep artefacts up to date as understanding evolves and run relevant tests + before concluding. + + ## Canonical command prompts + These excerpts are copied directly from the Specify CLI. Use them verbatim when + constructing the SPECIFICATION, PLAN, and TASKS artefacts. + + ### templates/commands/specify.md + ```markdown + {assets['spec_command'].strip()} + ``` + + ### templates/commands/plan.md + ```markdown + {assets['plan_command'].strip()} + ``` + + ### templates/commands/tasks.md + ```markdown + {assets['tasks_command'].strip()} + ``` + + ## Canonical templates + Reference these structures while drafting the artefacts so they stay aligned + with the Specify CLI outputs. + + ### templates/spec-template.md + ```markdown + {assets['spec_template'].strip()} + ``` + + ### templates/plan-template.md + ```markdown + {assets['plan_template'].strip()} + ``` + + ### templates/tasks-template.md + ```markdown + {assets['tasks_template'].strip()} + ``` + + Proceed only after you have completed the SPECIFICATION, PLAN, and TASKS + sections above. Once `BEGIN EXECUTION` has been emitted, follow the plan to + completion or explain any blockers. + """ + ).strip() + + +class SpecifyWorkflowMixin: + """Override instruction rendering to include Spec Kit workflow guidance.""" + + def _render_instruction(self, instruction: str) -> str: # type: ignore[override] + return _build_workflow_prompt(instruction) + + +class SpecifyClaudeWorkflowAgent(SpecifyWorkflowMixin, ClaudeCodeAgent): + """Claude Code agent preconfigured with the Spec -> Plan -> Tasks workflow.""" + + @staticmethod + def name() -> str: + return "specify_claude_workflow" + + def __init__(self, model_name: str | None = None, *args, **kwargs): + super().__init__(model_name=model_name, *args, **kwargs) + + +class SpecifyOpenCodeWorkflowAgent(SpecifyWorkflowMixin, OpenCodeAgent): + """OpenCode agent that drives the Spec -> Plan -> Tasks workflow.""" + + _DEFAULT_MODEL = "opencode/grok-code-fast-1" + + @staticmethod + def name() -> str: + return "specify_opencode_workflow" + + def __init__(self, model_name: str | None = None, *args, **kwargs): + super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs) + + def _render_instruction(self, instruction: str) -> str: # type: ignore[override] + # OpenCode uses stored prompts via `opencode run --command specify`, so pass + # through the raw task instruction and let the CLI command wrap it. + return instruction + + def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]: + escaped_instruction = shlex.quote(instruction) + return [ + TerminalCommand( + command=( + f"opencode --model {self._model_name} -p specify run --command specify {escaped_instruction}" + ), + min_timeout_sec=0.0, + max_timeout_sec=float("inf"), + block=True, + append_enter=True, + ), + ] + + @property + def _env(self) -> dict[str, str]: # type: ignore[override] + if getattr(self, "_provider", None) == "opencode": + # OpenCode public models do not require credentials, but allow an + # override if the user exports OPENCODE_API_KEY. + env: dict[str, str] = {} + if "OPENCODE_API_KEY" in os.environ: + env["OPENCODE_API_KEY"] = os.environ["OPENCODE_API_KEY"] + return env + return super()._env diff --git a/benchmarks/terminal_bench_agent/specify_terminal_bench/claude-code-setup.sh.j2 b/benchmarks/terminal_bench_agent/specify_terminal_bench/claude-code-setup.sh.j2 new file mode 100644 index 000000000..aa567c198 --- /dev/null +++ b/benchmarks/terminal_bench_agent/specify_terminal_bench/claude-code-setup.sh.j2 @@ -0,0 +1,13 @@ +#!/bin/bash + +apt-get update +apt-get install -y curl + +curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash + +source "$HOME/.nvm/nvm.sh" + +nvm install 22 +npm -v + +npm install -g @anthropic-ai/claude-code@{{ version }} diff --git a/benchmarks/terminal_bench_agent/specify_terminal_bench/opencode-setup.sh.j2 b/benchmarks/terminal_bench_agent/specify_terminal_bench/opencode-setup.sh.j2 new file mode 100644 index 000000000..a3e276763 --- /dev/null +++ b/benchmarks/terminal_bench_agent/specify_terminal_bench/opencode-setup.sh.j2 @@ -0,0 +1,23 @@ +#!/bin/bash +set -euo pipefail + +apt-get update +apt-get install -y curl git python3 python3-venv + +# Install Node ecosystem for OpenCode CLI +curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash + +source "$HOME/.nvm/nvm.sh" + +nvm install 22 +npm -v + +npm i -g opencode-ai@{{ version }} + +# Install uv for Specify CLI bootstrap +curl -LsSf https://astral.sh/uv/install.sh | sh +export PATH="$HOME/.local/bin:$PATH" + +# Bootstrap Specify prompts inside the task repository +cd /app +uvx --from git+https://github.com/github/spec-kit.git specify init --no-git --ai opencode --script sh --ignore-agent-tools task-specification diff --git a/benchmarks/terminal_bench_agent/specify_terminal_bench/prompt_templates/specify_workflow.j2 b/benchmarks/terminal_bench_agent/specify_terminal_bench/prompt_templates/specify_workflow.j2 new file mode 100644 index 000000000..fe470ccca --- /dev/null +++ b/benchmarks/terminal_bench_agent/specify_terminal_bench/prompt_templates/specify_workflow.j2 @@ -0,0 +1,49 @@ +{# Specify workflow template used by SpecifyClaudeWorkflowAgent #} +You are the Specify CLI benchmarking agent. You are being evaluated on your ability to run +a disciplined specification-driven workflow inside a terminal-only environment. + +Task instruction from the benchmark: +{{ instruction }} + +Always follow the Spec -> Plan -> Tasks (SPT) workflow *before* writing or editing +production code. Keep output concise and actionable for terminal use. + +Checklist before you begin coding: +1. Inspect repository metadata (README, docs/, tests/) to understand context. +2. If a constitution or non-negotiable guidelines file exists (COMMON PATHS: + `CONSTITUTION.md`, `/memory/constitution.md`, `.specify/constitution.md`), read it + and obey it throughout the session. +3. Capture any missing information as questions rather than assumptions. + +Deliver the following artefacts in your first response *in this order*: + +SPECIFICATION +- Summarise the desired behaviour and the user impact in plain language. +- Highlight scope boundaries and success criteria. +- Record open questions as bullet points prefixed with `NEEDS CLARIFICATION:`. + +PLAN +- Produce an ordered implementation strategy (5-10 steps). +- Note which files you expect to touch and why. +- Identify validation steps (tests, linters, sanity checks). + +TASKS +- Emit a numbered task list (T001, T002, ...) in dependency order. +- Tag tasks that can run in parallel with `[P]`. +- Each task must describe the concrete change plus the command(s) you will run. + +After you print the task list, explicitly write `BEGIN EXECUTION` and then carry out +the tasks sequentially. While executing: +- Announce the task ID when starting or finishing a task. +- Keep artefacts up to date (update spec/plan/tasks sections in your messages when the + understanding changes). +- Prefer small, reviewable commits; run repository tests when they exist. +- Exit paging programs (`less`, editors) immediately after retrieving the needed output. + +Completion requirements: +- All tasks have been executed or intentionally skipped with justification. +- Tests relevant to the change have been run (and rerun after fixes). +- Final message contains: summary of changes, test evidence, any follow-up work. + +If at any point requirements conflict with the constitution or repository tests fail, +stop progressing and explain what blocks you.