ACE (Active learning for Capability Evaluation) is a novel framework that uses active learning and powerful language models to automate fine-grained evaluation of foundation models. It enables scalable, adaptive testing that uncovers strengths and weaknesses beyond static benchmarks.
The development environment can be set up using poetry. Hence, make sure it is installed and then run:
python3 -m poetry install
source $(poetry env info --path)/bin/activateIn order to install dependencies for testing (codestyle, unit tests, integration tests), run:
python3 -m poetry install --with testThe capability evaluation logs (evaluated using Inspect) are stored in a GCP bucket. Use the following command to log in using your GCP account:
gcloud auth application-default login- Set environment variables:
- OPENAI_API_KEY
- GOOGLE_API_KEY - To use LLMs provided by Google
- ANTHROPIC_API_KEY - To use LLMs provided by Anthropic
- Rate limit vars (default values given):
- RATE_LIMIT_CALLS=5
- RATE_LIMIT_PERIOD=60
- LangSmith tracing vars:
- LANGSMITH_TRACING=true
- LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
- LANGSMITH_API_KEY=<langsmith_api_key>
- LANGSMITH_PROJECT="automated_capability_evaluation"
- GCP env vars:
- GOOGLE_CLOUD_PROJECT=<project_id>
- Modify
src/cfg/run_cfg.yaml, if required.
Generates capability names and descriptions in the first step. In the second step, for each capability, it generates tasks, solves them, and verifies the solutions.
python -m src.run_capability_generationEvaluates the subject LLM on the generated capabilities and calculates a score for each.
python -m src.run_evaluationUtilize the capability and the corresponding subject LLM score to select or generate a new capability.
python -m src.run_lboThese scripts implement the multi-agent debate workflow for automated generation of areas, capabilities, tasks, and solutions.
All configurable parameters are defined in src/cfg/agentic_config.yaml.
The pipeline uses auto-generated tags to organize outputs from each step. Understanding how tags work is essential for running the pipeline:
- Tag Format: Tags are automatically generated timestamps in the format
_YYYYMMDD_HHMMSS(e.g.,_20251104_143022) - Auto-Generation: When you run a step (e.g., Generate Areas), the script automatically creates a tag and includes it in the output path
- Finding Tags: After running a step, check the console output or the output directory to see the generated tag. The tag appears in the file path where outputs are saved
- Using Tags: To run the next step in the pipeline, you need to specify the tag from the previous step's output:
- Step 2 (Generate Capabilities) needs
areas_tagfrom Step 1 - Step 3 (Generate Tasks) needs
capabilities_tagfrom Step 2 - Step 4 (Generate Solutions) needs
tasks_tagfrom Step 3
- Step 2 (Generate Capabilities) needs
Example Workflow:
- Run
python -m src.agentic_area_generator→ outputs to.../areas/_20251104_143022/areas.json - Use the tag
_20251104_143022in the next step:python -m src.agentic_capability_generator pipeline_tags.areas_tag=_20251104_143022
- The capability generator outputs to
.../capabilities/_20251104_150315/... - Use this new tag for the next step, and so on.
Generate domain areas using the scientist–moderator debate system:
python -m src.agentic_area_generatorThis step auto-generates a tag (e.g., _20251104_143022) and outputs the results to:
Output location:
~/<output_dir>/<domain>/<exp_id>/areas/<areas_tag>/areas.json
Where:
<output_dir>comes fromglobal_cfg.output_dir<domain>comes fromglobal_cfg.domain(spaces replaced with underscores)<exp_id>comes fromexp_cfg.exp_id<areas_tag>is the auto-generated tag for this run (use this tag in Step 2)
Generate capabilities for each area:
# Use the areas_tag from Step 1 (Generate Areas) output
python -m src.agentic_capability_generator pipeline_tags.areas_tag=_YYYYMMDD_HHMMSS pipeline_tags.resume_capabilities_tag=_YYYYMMDD_HHMMSSOptions:
pipeline_tags.areas_tagspecifies which set of areas to use when generating capabilities. This should be the<areas_tag>from the output of Step 1 (Generate Areas).pipeline_tags.resume_capabilities_tag(optional) resumes a previous capability generation run.
This step auto-generates a new tag for the capabilities output.
Output location:
~/<output_dir>/<domain>/<exp_id>/capabilities/<capabilities_tag>/<area>/capabilities.json
Where:
<capabilities_tag>is the auto-generated tag for this run (use this tag in Step 3)
Generate evaluation tasks for a specific capabilities tag:
# Use the capabilities_tag from Step 2 (Generate Capabilities) output
python -m src.agentic_task_generator pipeline_tags.capabilities_tag=_YYYYMMDD_HHMMSS pipeline_tags.resume_tasks_tag=_YYYYMMDD_HHMMSSOptions:
pipeline_tags.capabilities_tagspecifies which set of capabilities to use when generating tasks. This should be the<capabilities_tag>from the output of Step 2 (Generate Capabilities).pipeline_tags.resume_tasks_tag(optional) resumes a previous task generation run.
This step auto-generates a new tag for the tasks output.
Output location:
~/<output_dir>/<domain>/<exp_id>/tasks/<tasks_tag>/[<area>]-[<capability>]/tasks.json
Where:
<tasks_tag>is the auto-generated tag for this run (use this tag in Step 4)
Solve generated tasks using the multi-agent debate system:
# Use the tasks_tag from Step 3 (Generate Tasks) output
python -m src.agentic_task_solver pipeline_tags.tasks_tag=_YYYYMMDD_HHMMSS pipeline_tags.resume_solutions_tag=_YYYYMMDD_HHMMSSOptions:
pipeline_tags.tasks_tagspecifies which set of tasks to solve. This should be the<tasks_tag>from the output of Step 3 (Generate Tasks).pipeline_tags.resume_solutions_tag(optional) resumes a previous solution generation run.
This step auto-generates a new tag for the solutions output.
Output location:
~/<output_dir>/<domain>/<exp_id>/task_solutions/<solutions_tag>/[<area>]-[<capability>]/<task_id>_solution.json
Where:
<solutions_tag>is the auto-generated tag for this run
Tools for extracting, processing, and matching mathematical capabilities from Wikipedia. All prompts are centralized in wikipedia/prompts.py.
Scrapes Wikipedia's "Glossary of areas of mathematics", extracts capability descriptions, and generates summaries with LLM-powered categorization.
cd wikipedia
python wikipedia_scraper.pyOutputs JSON files to wikipedia/pages/ containing capability_name, description, summary, area, url, and timestamp.
Matches Wikipedia capabilities with generated capabilities using LLM-based similarity analysis. Supports bidirectional matching.
Configure wikipedia/cfg/wiki_vs_generated.yaml:
data_cfg.wikipedia_pages_dir: Wikipedia pages directorydata_cfg.generated_dir: Generated capabilities directoryprocessing_cfg.match_direction:generated_to_wikipediaorwikipedia_to_generated
cd wikipedia
python wiki_vs_generated.pyCategorizes questions from GSM8K or MATH datasets into mathematical areas using generated or Wikipedia taxonomies. Supports checkpoint-based resume.
Configure wikipedia/cfg/static_vs_generated.yaml:
data_cfg.dataset_name:gsm8kormathdata_cfg.dataset_path: Dataset file (GSM8K) or directory (MATH)categorization_cfg.extraction_method:generatedorwikipedia
cd wikipedia
python static_vs_generated.py