Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Global code owner
* @localden
/benchmarks @adam-paterson
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
- [🎯 Experimental goals](#-experimental-goals)
- [🔧 Prerequisites](#-prerequisites)
- [📖 Learn more](#-learn-more)
- [📊 Benchmarking with Terminal Bench](#-benchmarking-with-terminal-bench)
- [📋 Detailed process](#-detailed-process)
- [🔍 Troubleshooting](#-troubleshooting)
- [👥 Maintainers](#-maintainers)
Expand Down Expand Up @@ -180,6 +181,26 @@ Our research and experimentation focus on:
- **[Complete Spec-Driven Development Methodology](./spec-driven.md)** - Deep dive into the full process
- **[Detailed Walkthrough](#-detailed-process)** - Step-by-step implementation guide

## 📊 Benchmarking with Terminal Bench

Benchmark the Specify workflow without impacting end users by using the standalone
Terminal Bench agent that lives in `benchmarks/terminal_bench_agent`. The project is
managed with uv and keeps heavy benchmarking dependencies separate from the main CLI.

```bash
cd benchmarks/terminal_bench_agent
uv sync
uv run tb run \
--dataset terminal-bench-core==head \
--task-id hello-world \
--agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
```

Set provider credentials only if you switch to a paid model (for example export
`ANTHROPIC_API_KEY` before using the Claude workflow agent). See
[`benchmarks/terminal_bench_agent/README.md`](benchmarks/terminal_bench_agent/README.md)
for detailed options and overrides.

---

## 📋 Detailed process
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__pycache__/
*.pyc
runs/
64 changes: 64 additions & 0 deletions benchmarks/terminal_bench_agent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Specify Terminal Bench Agent

This package provides Terminal Bench agents that drive the Spec -> Plan -> Tasks
workflow using the exact prompts and templates that ship with the Specify CLI. The
agents run outside the end-user CLI so benchmarking dependencies stay isolated.

## Project layout

```
benchmarks/
terminal_bench_agent/
pyproject.toml # standalone uv project for benchmarking-only deps
README.md # this guide
specify_terminal_bench/
__init__.py # package export
agent.py # workflow-aware agent definitions
prompt_templates/ # legacy prompt assets (unused by the new mixin)
```

## Getting started

1. Create an isolated environment for the benchmarking tools:
```bash
cd benchmarks/terminal_bench_agent
uv sync
```
2. (Optional) Export credentials for paid providers if you plan to use them
(e.g. `ANTHROPIC_API_KEY` for Claude Code).
3. Run Terminal Bench with the OpenCode workflow agent and the public core dataset:
```bash
uv run tb run \
--dataset terminal-bench-core==head \
--task-id hello-world \
--agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
```
This defaults to the free `opencode/grok-code-fast-1` model. Provide
`--agent-kwarg model_name=<provider/model>` if you want another OpenCode target.
4. To benchmark with Claude Code instead, switch the import path:
```bash
uv run tb run \
--dataset terminal-bench-core==head \
--task-id hello-world \
--agent-import-path specify_terminal_bench.agent:SpecifyClaudeWorkflowAgent \
--agent-kwarg model_name=anthropic/claude-3-5-sonnet-20241022
```

## Customisation

- The agents assemble their prompts at runtime from the real Specify CLI sources:
`templates/commands/specify.md`, `plan.md`, `tasks.md` and their corresponding
templates. Update those files in the main repository to change benchmarking
behaviour.
- Pass additional keyword arguments through `--agent-kwarg` to reach provider specific
options (e.g. `version=...`).
- If you need a different provider entirely, subclass the desired Terminal Bench agent
under `specify_terminal_bench/agent.py` and reuse `SpecifyWorkflowMixin`.

## Tips

- Terminal Bench requires Python 3.12+. The dedicated project keeps this dependency
separate from the end-user CLI which still targets Python 3.11.
- The agents read prompt assets from the repository root, so run benchmarks from the
root checkout.
- `uv run tb --help` lists additional switches (filtering tasks, resuming runs, etc.).
20 changes: 20 additions & 0 deletions benchmarks/terminal_bench_agent/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[project]
name = "specify-terminal-bench-agent"
version = "0.1.0"
description = "Terminal Bench agent that applies the Specify spec-driven workflow"
requires-python = ">=3.12"
dependencies = [
"terminal-bench>=0.2.17",
]

[project.readme]
file = "README.md"
content-type = "text/markdown"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["specify_terminal_bench"]
include = ["specify_terminal_bench/**"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .agent import SpecifyClaudeWorkflowAgent, SpecifyOpenCodeWorkflowAgent

__all__ = ["SpecifyClaudeWorkflowAgent", "SpecifyOpenCodeWorkflowAgent"]
174 changes: 174 additions & 0 deletions benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
from __future__ import annotations

import os
import shlex
from functools import lru_cache
from pathlib import Path
from textwrap import dedent

from terminal_bench.agents.installed_agents.claude_code.claude_code_agent import (
ClaudeCodeAgent,
)
from terminal_bench.agents.installed_agents.opencode.opencode_agent import (
OpenCodeAgent,
)


def _repo_root() -> Path:
"""Return the root of the Spec Kit repository."""

return Path(__file__).resolve().parents[4]


def _read_text(path: Path) -> str:
try:
return path.read_text()
except FileNotFoundError as exc: # pragma: no cover - fail fast during benchmarks
raise RuntimeError(
f"Required prompt asset missing: {path}"
) from exc


@lru_cache(maxsize=1)
def _prompt_assets() -> dict[str, str]:
"""Load the canonical Spec -> Plan -> Tasks prompts and templates."""

root = _repo_root()
return {
"spec_command": _read_text(root / "templates" / "commands" / "specify.md"),
"plan_command": _read_text(root / "templates" / "commands" / "plan.md"),
"tasks_command": _read_text(root / "templates" / "commands" / "tasks.md"),
"spec_template": _read_text(root / "templates" / "spec-template.md"),
"plan_template": _read_text(root / "templates" / "plan-template.md"),
"tasks_template": _read_text(root / "templates" / "tasks-template.md"),
}


def _build_workflow_prompt(instruction: str) -> str:
assets = _prompt_assets()

return dedent(
f"""
You are the Specify Spec Kit benchmarking agent. Your job is to apply the
Spec -> Plan -> Tasks workflow exactly as defined by the CLI prompts before
attempting any implementation work in the Terminal Bench task container.

Task instruction from Terminal Bench:
---
{instruction.strip()}
---

## Workflow expectations
1. Review repository context and any constitutions before acting.
2. Produce SPECIFICATION, PLAN, and TASKS sections in that order using the
canonical prompts below. Do not start execution until all three are
drafted.
3. After presenting the tasks list, print `BEGIN EXECUTION` and carry out the
tasks sequentially, announcing each task ID as you start and finish.
4. Keep artefacts up to date as understanding evolves and run relevant tests
before concluding.

## Canonical command prompts
These excerpts are copied directly from the Specify CLI. Use them verbatim when
constructing the SPECIFICATION, PLAN, and TASKS artefacts.

### templates/commands/specify.md
```markdown
{assets['spec_command'].strip()}
```

### templates/commands/plan.md
```markdown
{assets['plan_command'].strip()}
```

### templates/commands/tasks.md
```markdown
{assets['tasks_command'].strip()}
```

## Canonical templates
Reference these structures while drafting the artefacts so they stay aligned
with the Specify CLI outputs.

### templates/spec-template.md
```markdown
{assets['spec_template'].strip()}
```

### templates/plan-template.md
```markdown
{assets['plan_template'].strip()}
```

### templates/tasks-template.md
```markdown
{assets['tasks_template'].strip()}
```

Proceed only after you have completed the SPECIFICATION, PLAN, and TASKS
sections above. Once `BEGIN EXECUTION` has been emitted, follow the plan to
completion or explain any blockers.
"""
).strip()


class SpecifyWorkflowMixin:
"""Override instruction rendering to include Spec Kit workflow guidance."""

def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
return _build_workflow_prompt(instruction)


class SpecifyClaudeWorkflowAgent(SpecifyWorkflowMixin, ClaudeCodeAgent):
"""Claude Code agent preconfigured with the Spec -> Plan -> Tasks workflow."""

@staticmethod
def name() -> str:
return "specify_claude_workflow"

def __init__(self, model_name: str | None = None, *args, **kwargs):
super().__init__(model_name=model_name, *args, **kwargs)


class SpecifyOpenCodeWorkflowAgent(SpecifyWorkflowMixin, OpenCodeAgent):
"""OpenCode agent that drives the Spec -> Plan -> Tasks workflow."""

_DEFAULT_MODEL = "opencode/grok-code-fast-1"

@staticmethod
def name() -> str:
return "specify_opencode_workflow"

def __init__(self, model_name: str | None = None, *args, **kwargs):
super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs)

def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SpecifyOpenCodeWorkflowAgent overrides _render_instruction to bypass the workflow mixin's implementation, which contradicts the class inheritance design. Consider using composition instead of inheritance, or restructuring the mixin to make this override pattern more explicit.

Suggested change
def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
def render_instruction(self, instruction: str) -> str:

Copilot uses AI. Check for mistakes.
# OpenCode uses stored prompts via `opencode run --command specify`, so pass
# through the raw task instruction and let the CLI command wrap it.
return instruction

def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]:
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for TerminalCommand. This class is used in the return type annotation and instantiated on line 154, but it's not imported at the top of the file.

Copilot uses AI. Check for mistakes.
escaped_instruction = shlex.quote(instruction)
return [
TerminalCommand(
command=(
f"opencode --model {self._model_name} -p specify run --command specify {escaped_instruction}"
),
min_timeout_sec=0.0,
max_timeout_sec=float("inf"),
block=True,
append_enter=True,
),
]

@property
def _env(self) -> dict[str, str]: # type: ignore[override]
if getattr(self, "_provider", None) == "opencode":
# OpenCode public models do not require credentials, but allow an
# override if the user exports OPENCODE_API_KEY.
env: dict[str, str] = {}
if "OPENCODE_API_KEY" in os.environ:
env["OPENCODE_API_KEY"] = os.environ["OPENCODE_API_KEY"]
return env
return super()._env
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

apt-get update
apt-get install -y curl

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

Comment on lines +6 to +7
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
# Download nvm install script
NVM_VERSION="v0.40.2"
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"
# Expected SHA256 checksum for nvm v0.40.2 install.sh
EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum
ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
echo "ERROR: Checksum verification failed for nvm install.sh!"
echo "Expected: $EXPECTED_SHA256"
echo "Actual: $ACTUAL_SHA256"
exit 1
fi
bash "$NVM_INSTALL_SCRIPT"

Copilot uses AI. Check for mistakes.
source "$HOME/.nvm/nvm.sh"

nvm install 22
npm -v

npm install -g @anthropic-ai/claude-code@{{ version }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -euo pipefail
Comment on lines +1 to +2
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script uses set -euo pipefail but line 8 pipes curl output to bash, which could mask curl failures. Consider using intermediate error checking or separating the download and execution steps.

Copilot uses AI. Check for mistakes.

apt-get update
apt-get install -y curl git python3 python3-venv

# Install Node ecosystem for OpenCode CLI
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

Comment on lines +8 to +9
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
NVM_VERSION="v0.40.2"
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
NVM_INSTALL_SH="/tmp/nvm-install.sh"
NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"
echo "${NVM_INSTALL_SH_SHA256} $NVM_INSTALL_SH" | sha256sum -c -
bash "$NVM_INSTALL_SH"

Copilot uses AI. Check for mistakes.
source "$HOME/.nvm/nvm.sh"

nvm install 22
npm -v

npm i -g opencode-ai@{{ version }}

# Install uv for Specify CLI bootstrap
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Bootstrap Specify prompts inside the task repository
cd /app
uvx --from git+https://github.com/github/spec-kit.git specify init --no-git --ai opencode --script sh --ignore-agent-tools task-specification
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{# Specify workflow template used by SpecifyClaudeWorkflowAgent #}
You are the Specify CLI benchmarking agent. You are being evaluated on your ability to run
a disciplined specification-driven workflow inside a terminal-only environment.

Task instruction from the benchmark:
{{ instruction }}

Always follow the Spec -> Plan -> Tasks (SPT) workflow *before* writing or editing
production code. Keep output concise and actionable for terminal use.

Checklist before you begin coding:
1. Inspect repository metadata (README, docs/, tests/) to understand context.
2. If a constitution or non-negotiable guidelines file exists (COMMON PATHS:
`CONSTITUTION.md`, `/memory/constitution.md`, `.specify/constitution.md`), read it
and obey it throughout the session.
3. Capture any missing information as questions rather than assumptions.

Deliver the following artefacts in your first response *in this order*:

SPECIFICATION
- Summarise the desired behaviour and the user impact in plain language.
- Highlight scope boundaries and success criteria.
- Record open questions as bullet points prefixed with `NEEDS CLARIFICATION:`.

PLAN
- Produce an ordered implementation strategy (5-10 steps).
- Note which files you expect to touch and why.
- Identify validation steps (tests, linters, sanity checks).

TASKS
- Emit a numbered task list (T001, T002, ...) in dependency order.
- Tag tasks that can run in parallel with `[P]`.
- Each task must describe the concrete change plus the command(s) you will run.

After you print the task list, explicitly write `BEGIN EXECUTION` and then carry out
the tasks sequentially. While executing:
- Announce the task ID when starting or finishing a task.
- Keep artefacts up to date (update spec/plan/tasks sections in your messages when the
understanding changes).
- Prefer small, reviewable commits; run repository tests when they exist.
- Exit paging programs (`less`, editors) immediately after retrieving the needed output.

Completion requirements:
- All tasks have been executed or intentionally skipped with justification.
- Tests relevant to the change have been run (and rerun after fixes).
- Final message contains: summary of changes, test evidence, any follow-up work.

If at any point requirements conflict with the constitution or repository tests fail,
stop progressing and explain what blocks you.