Minimal practical implementation of "Eliciting Reasoning in Language Models with Cognitive Tools" (Ebouky et al., 2025) - a modular framework that enhances LLM reasoning through cognitive psychology-inspired tools without requiring additional training.
This framework demonstrates that modular cognitive operations can unlock reasoning capabilities in base language models, achieving near-reasoning-model performance without reinforcement learning:
- GPT-4.1 baseline: 26.7% on AIME 2024
- GPT-4.1 + cognitive tools: 43.3% on AIME 2024
- o1-preview (RL-trained): 44.6% on AIME 2024
The approach closes 94% of the performance gap to state-of-the-art reasoning models using only structured prompting.
cognitive-tools-llm/
├── README.md # Complete guide and documentation
├── cognitive_tools.py # Core framework: all tools + orchestrator
├── reproduce_paper.py # Exact experimental reproduction
├── demo.py # Interactive examples (jupyter-style)
├── evaluate.py # Evaluation framework and metrics
├── benchmarks.json # Benchmark datasets
└── requirements.txt # Dependencies
This minimal structure maximizes information density while ensuring immediate accessibility for both novice learners and experienced researchers.
git clone https://github.com/davidkimai/Cognitive-Tools.git
cd Cognitive-Tools
pip install -r requirements.txt
from cognitive_tools import CognitiveToolsOrchestrator
# Initialize with your LLM interface
orchestrator = CognitiveToolsOrchestrator(llm_interface)
# Solve problems with cognitive tools
result = orchestrator.solve_problem("Find the GCD of 3339, 2961, and 1491.")
print(result) # Uses understand_question, examine_answer, backtracking tools
# Run jupyter-style demonstrations
python demo.py
# Reproduce paper experiments
python reproduce_paper.py
# Custom evaluation
from evaluate import CognitiveToolsEvaluator
evaluator = CognitiveToolsEvaluator(llm_interfaces)
result = await evaluator.evaluate_configuration("model", "benchmark", "cognitive_tools")
-
understand_question: Problem decomposition and goal management
- Identifies mathematical concepts and solution approaches
- Extracts relevant information and constraints
- Highlights applicable theorems and techniques
-
recall_related: Analogical reasoning through examples
- Retrieves similar solved problems from knowledge
- Provides step-by-step solution patterns
- Guides reasoning through structural similarities
-
examine_answer: Self-reflection and verification
- Checks reasoning traces for logical consistency
- Identifies miscalculations and wrong assumptions
- Validates solutions against problem constraints
-
backtracking: Alternative path exploration
- Detects flawed reasoning steps
- Suggests alternative solution approaches
- Enables systematic exploration of solution space
- Modularity: Each tool operates independently to prevent interference
- Flexibility: LLM decides which tools to use and when
- Transparency: Each reasoning step is explicit and traceable
- Composability: Tools can be combined in any sequence
Model | Baseline | Best Tool | Improvement |
---|---|---|---|
Qwen2.5-32B | 79.6% | 84.2% | +4.6% |
Llama3.3-70B | 52.8% | 79.5% | +26.7% |
Model | Cognitive Prompting | Cognitive Tools | Advantage |
---|---|---|---|
Qwen2.5-32B | 82.0% | 88.0% | +6.0% |
Llama3.3-70B | 66.0% | 80.0% | +14.0% |
Model | Benchmark | Baseline | With Tools | Improvement |
---|---|---|---|---|
Qwen2.5-32B | AIME 2024 | 17.2% | 32.1% | +14.9% |
Llama3.3-70B | MATH500 | 57.0% | 74.7% | +17.7% |
The framework implements cognitive architecture principles from Anderson et al. (1997), translating human reasoning patterns into modular LLM operations:
- Goal Management: Systematic problem decomposition
- Memory Retrieval: Analogical reasoning through examples
- Self-Monitoring: Metacognitive reflection and error detection
- Strategy Selection: Flexible exploration of solution paths
This approach provides evidence that base language models possess latent reasoning capabilities that structured cognitive workflows can unlock without requiring reinforcement learning.
# Simplified orchestration logic
def solve_problem(question):
conversation = [system_prompt, question]
while not finished:
response = llm.generate(conversation)
if "ANSWER:" in response:
return response
tool_calls = extract_tool_calls(response)
for call in tool_calls:
result = execute_tool(call)
conversation.append(tool_result)
The orchestrator uses a carefully designed system prompt that:
- Describes available cognitive tools and their functions
- Provides flexibility in tool usage decisions
- Enforces structured output with "ANSWER:" format
- Includes monetary reward incentive for motivation
Each cognitive tool is implemented as:
- Prompt Template: Specialized prompts for specific cognitive operations
- Input Parameters: Question, context, and reasoning state
- Output Processing: Structured responses fed back to main reasoning loop
- Pass@1 Accuracy: Percentage of problems solved correctly on first attempt
- Statistical Significance: Paired t-tests with multiple runs
- Error Analysis: Systematic categorization of reasoning failures
- AIME 2024: 30 problems, high-difficulty mathematical reasoning
- MATH500: 500 problems, varied difficulty with complex expressions
- AMC: 83 problems, medium-difficulty competition mathematics
- Smolbenchmark: 50 problems, diverse mathematical tasks
# Statistical evaluation with confidence intervals
result = evaluator.evaluate_configuration(
model_name="GPT-4.1",
benchmark_name="AIME2024",
configuration="cognitive_tools",
num_runs=8
)
print(f"Accuracy: {result.accuracy:.1f}% ± {result.std_error:.1f}")
- Mathematical Reasoning: Enhanced problem-solving on competition mathematics
- Educational Tools: Transparent reasoning for learning applications
- Research Analysis: Systematic exploration of reasoning capabilities
- Domain-Specific Tools: Custom cognitive operations for specialized fields
- Multi-Modal Reasoning: Integration with visual and symbolic reasoning
- Hybrid Approaches: Combination with reinforcement learning methods
- Domain Scope: Primarily evaluated on mathematical reasoning tasks
- Manual Design: Tools are hand-crafted rather than automatically discovered
- Model Dependence: Prompts optimized for specific model families
- Automated Tool Discovery: Machine learning approaches to identify effective cognitive operations
- Broader Domains: Extension to scientific reasoning, logical inference, and planning tasks
- Scale Studies: Evaluation on larger model architectures and parameter counts
@article{ebouky2025cognitive,
title={Eliciting Reasoning in Language Models with Cognitive Tools},
author={Ebouky, Brown and Bartezzaghi, Andrea and Rigotti, Mattia},
journal={arXiv preprint arXiv:2506.12115},
year={2025}
}
MIT License - see LICENSE file for details.
This repository prioritizes accessible minimalism and high information density. Contributions should:
- Maintain the flat, simple repository structure
- Follow the jupyter notebook style for new demonstrations
- Include comprehensive documentation within code files
- Provide quantitative validation against paper results
- Ensure accessibility for both novice and expert users
For questions or contributions, please open an issue or pull request.