github · adam-paterson · Sep 18, 2025 · Sep 18, 2025 · Copilot · Sep 18, 2025
@@ -1,2 +1,3 @@
 # Global code owner
 * @localden
+/benchmarks @adam-paterson
@@ -23,6 +23,7 @@
 - [🎯 Experimental goals](#-experimental-goals)
 - [🔧 Prerequisites](#-prerequisites)
 - [📖 Learn more](#-learn-more)
+- [📊 Benchmarking with Terminal Bench](#-benchmarking-with-terminal-bench)
 - [📋 Detailed process](#-detailed-process)
 - [🔍 Troubleshooting](#-troubleshooting)
 - [👥 Maintainers](#-maintainers)
@@ -180,6 +181,26 @@ Our research and experimentation focus on:
 - **[Complete Spec-Driven Development Methodology](./spec-driven.md)** - Deep dive into the full process
 - **[Detailed Walkthrough](#-detailed-process)** - Step-by-step implementation guide
 
+## 📊 Benchmarking with Terminal Bench
+
+Benchmark the Specify workflow without impacting end users by using the standalone
+Terminal Bench agent that lives in `benchmarks/terminal_bench_agent`. The project is
+managed with uv and keeps heavy benchmarking dependencies separate from the main CLI.
+
+```bash
+cd benchmarks/terminal_bench_agent
+uv sync
+uv run tb run \
+  --dataset terminal-bench-core==head \
+  --task-id hello-world \
+  --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
+```
+
+Set provider credentials only if you switch to a paid model (for example export
+`ANTHROPIC_API_KEY` before using the Claude workflow agent). See
+[`benchmarks/terminal_bench_agent/README.md`](benchmarks/terminal_bench_agent/README.md)
+for detailed options and overrides.
+
 ---
 
 ## 📋 Detailed process

@@ -0,0 +1,3 @@
+__pycache__/
+*.pyc
+runs/
@@ -0,0 +1,64 @@
+# Specify Terminal Bench Agent
+
+This package provides Terminal Bench agents that drive the Spec -> Plan -> Tasks
+workflow using the exact prompts and templates that ship with the Specify CLI. The
+agents run outside the end-user CLI so benchmarking dependencies stay isolated.
+
+## Project layout
+
+```
+benchmarks/
+  terminal_bench_agent/
+    pyproject.toml         # standalone uv project for benchmarking-only deps
+    README.md              # this guide
+    specify_terminal_bench/
+      __init__.py          # package export
+      agent.py             # workflow-aware agent definitions
+      prompt_templates/    # legacy prompt assets (unused by the new mixin)
+```
+
+## Getting started
+
+1. Create an isolated environment for the benchmarking tools:
+   ```bash
+   cd benchmarks/terminal_bench_agent
+   uv sync
+   ```
+2. (Optional) Export credentials for paid providers if you plan to use them
+   (e.g. `ANTHROPIC_API_KEY` for Claude Code).
+3. Run Terminal Bench with the OpenCode workflow agent and the public core dataset:
+   ```bash
+   uv run tb run \
+     --dataset terminal-bench-core==head \
+     --task-id hello-world \
+     --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
+   ```
+   This defaults to the free `opencode/grok-code-fast-1` model. Provide
+   `--agent-kwarg model_name=<provider/model>` if you want another OpenCode target.
+4. To benchmark with Claude Code instead, switch the import path:
+   ```bash
+   uv run tb run \
+     --dataset terminal-bench-core==head \
+     --task-id hello-world \
+     --agent-import-path specify_terminal_bench.agent:SpecifyClaudeWorkflowAgent \
+     --agent-kwarg model_name=anthropic/claude-3-5-sonnet-20241022
+   ```
+
+## Customisation
+
+- The agents assemble their prompts at runtime from the real Specify CLI sources:
+  `templates/commands/specify.md`, `plan.md`, `tasks.md` and their corresponding
+  templates. Update those files in the main repository to change benchmarking
+  behaviour.
+- Pass additional keyword arguments through `--agent-kwarg` to reach provider specific
+  options (e.g. `version=...`).
+- If you need a different provider entirely, subclass the desired Terminal Bench agent
+  under `specify_terminal_bench/agent.py` and reuse `SpecifyWorkflowMixin`.
+
+## Tips
+
+- Terminal Bench requires Python 3.12+. The dedicated project keeps this dependency
+  separate from the end-user CLI which still targets Python 3.11.
+- The agents read prompt assets from the repository root, so run benchmarks from the
+  root checkout.
+- `uv run tb --help` lists additional switches (filtering tasks, resuming runs, etc.).
@@ -0,0 +1,20 @@
+[project]
+name = "specify-terminal-bench-agent"
+version = "0.1.0"
+description = "Terminal Bench agent that applies the Specify spec-driven workflow"
+requires-python = ">=3.12"
+dependencies = [
+    "terminal-bench>=0.2.17",
+]
+
+[project.readme]
+file = "README.md"
+content-type = "text/markdown"
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build.targets.wheel]
+packages = ["specify_terminal_bench"]
+include = ["specify_terminal_bench/**"]
@@ -0,0 +1,3 @@
+from .agent import SpecifyClaudeWorkflowAgent, SpecifyOpenCodeWorkflowAgent
+
+__all__ = ["SpecifyClaudeWorkflowAgent", "SpecifyOpenCodeWorkflowAgent"]
@@ -0,0 +1,174 @@
+from __future__ import annotations
+
+import os
+import shlex
+from functools import lru_cache
+from pathlib import Path
+from textwrap import dedent
+
+from terminal_bench.agents.installed_agents.claude_code.claude_code_agent import (
+    ClaudeCodeAgent,
+)
+from terminal_bench.agents.installed_agents.opencode.opencode_agent import (
+    OpenCodeAgent,
+)
+
+
+def _repo_root() -> Path:
+    """Return the root of the Spec Kit repository."""
+
+    return Path(__file__).resolve().parents[4]
+
+
+def _read_text(path: Path) -> str:
+    try:
+        return path.read_text()
+    except FileNotFoundError as exc:  # pragma: no cover - fail fast during benchmarks
+        raise RuntimeError(
+            f"Required prompt asset missing: {path}"
+        ) from exc
+
+
+@lru_cache(maxsize=1)
+def _prompt_assets() -> dict[str, str]:
+    """Load the canonical Spec -> Plan -> Tasks prompts and templates."""
+
+    root = _repo_root()
+    return {
+        "spec_command": _read_text(root / "templates" / "commands" / "specify.md"),
+        "plan_command": _read_text(root / "templates" / "commands" / "plan.md"),
+        "tasks_command": _read_text(root / "templates" / "commands" / "tasks.md"),
+        "spec_template": _read_text(root / "templates" / "spec-template.md"),
+        "plan_template": _read_text(root / "templates" / "plan-template.md"),
+        "tasks_template": _read_text(root / "templates" / "tasks-template.md"),
+    }
+
+
+def _build_workflow_prompt(instruction: str) -> str:
+    assets = _prompt_assets()
+
+    return dedent(
+        f"""
+        You are the Specify Spec Kit benchmarking agent. Your job is to apply the
+        Spec -> Plan -> Tasks workflow exactly as defined by the CLI prompts before
+        attempting any implementation work in the Terminal Bench task container.
+
+        Task instruction from Terminal Bench:
+        ---
+        {instruction.strip()}
+        ---
+
+        ## Workflow expectations
+        1. Review repository context and any constitutions before acting.
+        2. Produce SPECIFICATION, PLAN, and TASKS sections in that order using the
+           canonical prompts below. Do not start execution until all three are
+           drafted.
+        3. After presenting the tasks list, print `BEGIN EXECUTION` and carry out the
+           tasks sequentially, announcing each task ID as you start and finish.
+        4. Keep artefacts up to date as understanding evolves and run relevant tests
+           before concluding.
+
+        ## Canonical command prompts
+        These excerpts are copied directly from the Specify CLI. Use them verbatim when
+        constructing the SPECIFICATION, PLAN, and TASKS artefacts.
+
+        ### templates/commands/specify.md
+        ```markdown
+        {assets['spec_command'].strip()}
+        ```
+
+        ### templates/commands/plan.md
+        ```markdown
+        {assets['plan_command'].strip()}
+        ```
+
+        ### templates/commands/tasks.md
+        ```markdown
+        {assets['tasks_command'].strip()}
+        ```
+
+        ## Canonical templates
+        Reference these structures while drafting the artefacts so they stay aligned
+        with the Specify CLI outputs.
+
+        ### templates/spec-template.md
+        ```markdown
+        {assets['spec_template'].strip()}
+        ```
+
+        ### templates/plan-template.md
+        ```markdown
+        {assets['plan_template'].strip()}
+        ```
+
+        ### templates/tasks-template.md
+        ```markdown
+        {assets['tasks_template'].strip()}
+        ```
+
+        Proceed only after you have completed the SPECIFICATION, PLAN, and TASKS
+        sections above. Once `BEGIN EXECUTION` has been emitted, follow the plan to
+        completion or explain any blockers.
+        """
+    ).strip()
+
+
+class SpecifyWorkflowMixin:
+    """Override instruction rendering to include Spec Kit workflow guidance."""
+
+    def _render_instruction(self, instruction: str) -> str:  # type: ignore[override]
+        return _build_workflow_prompt(instruction)
+
+
+class SpecifyClaudeWorkflowAgent(SpecifyWorkflowMixin, ClaudeCodeAgent):
+    """Claude Code agent preconfigured with the Spec -> Plan -> Tasks workflow."""
+
+    @staticmethod
+    def name() -> str:
+        return "specify_claude_workflow"
+
+    def __init__(self, model_name: str | None = None, *args, **kwargs):
+        super().__init__(model_name=model_name, *args, **kwargs)
+
+
+class SpecifyOpenCodeWorkflowAgent(SpecifyWorkflowMixin, OpenCodeAgent):
+    """OpenCode agent that drives the Spec -> Plan -> Tasks workflow."""
+
+    _DEFAULT_MODEL = "opencode/grok-code-fast-1"
+
+    @staticmethod
+    def name() -> str:
+        return "specify_opencode_workflow"
+
+    def __init__(self, model_name: str | None = None, *args, **kwargs):
+        super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs)
+
+    def _render_instruction(self, instruction: str) -> str:  # type: ignore[override]
-    def _render_instruction(self, instruction: str) -> str:  # type: ignore[override]
+    def render_instruction(self, instruction: str) -> str:
-    def _render_instruction(self, instruction: str) -> str:  # type: ignore[override]
+    def render_instruction(self, instruction: str) -> str:
+        # OpenCode uses stored prompts via `opencode run --command specify`, so pass
+        # through the raw task instruction and let the CLI command wrap it.
+        return instruction
+
+    def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]:
+        escaped_instruction = shlex.quote(instruction)
+        return [
+            TerminalCommand(
+                command=(
+                    f"opencode --model {self._model_name} -p specify run --command specify {escaped_instruction}"
+                ),
+                min_timeout_sec=0.0,
+                max_timeout_sec=float("inf"),
+                block=True,
+                append_enter=True,
+            ),
+        ]
+
+    @property
+    def _env(self) -> dict[str, str]:  # type: ignore[override]
+        if getattr(self, "_provider", None) == "opencode":
+            # OpenCode public models do not require credentials, but allow an
+            # override if the user exports OPENCODE_API_KEY.
+            env: dict[str, str] = {}
+            if "OPENCODE_API_KEY" in os.environ:
+                env["OPENCODE_API_KEY"] = os.environ["OPENCODE_API_KEY"]
+            return env
+        return super()._env
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+apt-get update
+apt-get install -y curl
+
+curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+
-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+# Download nvm install script
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"
+
+# Expected SHA256 checksum for nvm v0.40.2 install.sh
+EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum
+ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"
+
+if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
+  echo "ERROR: Checksum verification failed for nvm install.sh!"
+  echo "Expected: $EXPECTED_SHA256"
+  echo "Actual:   $ACTUAL_SHA256"
+  exit 1
+fi
+
+bash "$NVM_INSTALL_SCRIPT"
-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+# Download nvm install script
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"
+
+# Expected SHA256 checksum for nvm v0.40.2 install.sh
+EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum
+ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"
+
+if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
+  echo "ERROR: Checksum verification failed for nvm install.sh!"
+  echo "Expected: $EXPECTED_SHA256"
+  echo "Actual:   $ACTUAL_SHA256"
+  exit 1
+fi
+
+bash "$NVM_INSTALL_SCRIPT"
+source "$HOME/.nvm/nvm.sh"
+
+nvm install 22
+npm -v
+
+npm install -g @anthropic-ai/claude-code@{{ version }}
@@ -0,0 +1,23 @@
+#!/bin/bash
+set -euo pipefail
+
+apt-get update
+apt-get install -y curl git python3 python3-venv
+
+# Install Node ecosystem for OpenCode CLI
+curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+
-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SH="/tmp/nvm-install.sh"
+NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release
+
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"
+echo "${NVM_INSTALL_SH_SHA256}  $NVM_INSTALL_SH" | sha256sum -c -
+bash "$NVM_INSTALL_SH"
-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SH="/tmp/nvm-install.sh"
+NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release
+
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"
+echo "${NVM_INSTALL_SH_SHA256}  $NVM_INSTALL_SH" | sha256sum -c -
+bash "$NVM_INSTALL_SH"
+source "$HOME/.nvm/nvm.sh"
+
+nvm install 22
+npm -v
+
+npm i -g opencode-ai@{{ version }}
+
+# Install uv for Specify CLI bootstrap
+curl -LsSf https://astral.sh/uv/install.sh | sh
+export PATH="$HOME/.local/bin:$PATH"
+
+# Bootstrap Specify prompts inside the task repository
+cd /app
+uvx --from git+https://github.com/github/spec-kit.git specify init --no-git --ai opencode --script sh --ignore-agent-tools task-specification
@@ -0,0 +1,49 @@
+{# Specify workflow template used by SpecifyClaudeWorkflowAgent #}
+You are the Specify CLI benchmarking agent. You are being evaluated on your ability to run
+a disciplined specification-driven workflow inside a terminal-only environment.
+
+Task instruction from the benchmark:
+{{ instruction }}
+
+Always follow the Spec -> Plan -> Tasks (SPT) workflow *before* writing or editing
+production code. Keep output concise and actionable for terminal use.
+
+Checklist before you begin coding:
+1. Inspect repository metadata (README, docs/, tests/) to understand context.
+2. If a constitution or non-negotiable guidelines file exists (COMMON PATHS:
+   `CONSTITUTION.md`, `/memory/constitution.md`, `.specify/constitution.md`), read it
+   and obey it throughout the session.
+3. Capture any missing information as questions rather than assumptions.
+
+Deliver the following artefacts in your first response *in this order*:
+
+SPECIFICATION
+- Summarise the desired behaviour and the user impact in plain language.
+- Highlight scope boundaries and success criteria.
+- Record open questions as bullet points prefixed with `NEEDS CLARIFICATION:`.
+
+PLAN
+- Produce an ordered implementation strategy (5-10 steps).
+- Note which files you expect to touch and why.
+- Identify validation steps (tests, linters, sanity checks).
+
+TASKS
+- Emit a numbered task list (T001, T002, ...) in dependency order.
+- Tag tasks that can run in parallel with `[P]`.
+- Each task must describe the concrete change plus the command(s) you will run.
+
+After you print the task list, explicitly write `BEGIN EXECUTION` and then carry out
+the tasks sequentially. While executing:
+- Announce the task ID when starting or finishing a task.
+- Keep artefacts up to date (update spec/plan/tasks sections in your messages when the
+  understanding changes).
+- Prefer small, reviewable commits; run repository tests when they exist.
+- Exit paging programs (`less`, editors) immediately after retrieving the needed output.
+
+Completion requirements:
+- All tasks have been executed or intentionally skipped with justification.
+- Tests relevant to the change have been run (and rerun after fixes).
+- Final message contains: summary of changes, test evidence, any follow-up work.
+
+If at any point requirements conflict with the constitution or repository tests fail,
+stop progressing and explain what blocks you.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from .agent import SpecifyClaudeWorkflowAgent, SpecifyOpenCodeWorkflowAgent

		__all__ = ["SpecifyClaudeWorkflowAgent", "SpecifyOpenCodeWorkflowAgent"]