A comprehensive benchmarking framework for evaluating Large Language Models (LLMs) as autonomous agents in MiniGrid environments using SARSA (State-Action-Reward-State-Action) reinforcement learning principles.
This project implements a sophisticated evaluation framework that tests LLMs' ability to navigate and solve grid-world environments. Unlike traditional implementations, our approach uses true SARSA methodology by providing agents with visual history of previous states, enabling better spatial reasoning and trajectory understanding.
- ๐ SARSA-Based Learning: Full state-action-reward-state-action implementation with visual state history
- ๐จ Rich Visualizations: Automatic trajectory GIF generation with embedded metadata
- ๐ Multi-Model Support: Compatible with OpenAI (GPT-4o, GPT-5) and Google (Gemini) models
- ๐ Comprehensive Logging: Detailed step-by-step tracking with replay buffer visualization
- ๐ Text Overlays: Model name, step number, and buffer size burned into images
- ๐ฎ Multiple Environments: Support for various MiniGrid environments (Empty, LavaCrossing, SimpleCrossing, etc.)
# Clone the repository
git clone <repository-url>
cd minigrid-bench
# Install dependencies
pip install gymnasium minigrid pillow openai google-generativeai numpy
# Set up API keys
export OPENAI_API_KEY="your-openai-api-key"
export GEMINI_API_KEY="your-gemini-api-key"# Run a simple experiment with GPT-4o
python bench.py --env_id MiniGrid-Empty-5x5-v0 --provider openai --model gpt-4o --episodes 1 --max_steps 20 --replay_len 3
# Run with Gemini Pro
python bench.py --env_id MiniGrid-LavaCrossingS9N1-v0 --provider gemini --model gemini-2.5-pro --episodes 1 --max_steps 50 --replay_len 5Our implementation follows true SARSA methodology:
- State Representation: Visual observations as PNG images
- Action Space: Discrete actions (left, right, forward, pickup, drop, toggle, done)
- Reward Signal: Environment-provided rewards
- State History: Previous states stored in replay buffer
- Action Selection: LLM-based policy with visual context
Current State Image โ
LLM (with Replay History) โ
Action Selection โ
Environment Step โ
Reward + Next State โ
Update Replay Buffer
minigrid-bench/
โโโ bench.py # Main benchmarking script
โโโ utils.py # Image processing and utilities
โโโ experiments/ # Generated experiment data
โ โโโ timestamp_env_model_id/
โ โ โโโ manifest.json # Experiment metadata
โ โ โโโ step_*.png # Individual step images
โ โ โโโ trajectory.gif # Complete trajectory visualization
โโโ README.md # This file
| Argument | Default | Description |
|---|---|---|
--env_id |
MiniGrid-Empty-5x5-v0 |
MiniGrid environment ID |
--provider |
openai |
LLM provider (openai or gemini) |
--model |
gpt-4o |
Model name |
--replay_len |
1 |
Replay buffer size (SARSA history length) |
--episodes |
1 |
Number of episodes to run |
--max_steps |
128 |
Maximum steps per episode |
--seed |
0 |
Random seed for reproducibility |
--top_p |
1.0 |
Nucleus sampling parameter |
--experiments_dir |
experiments |
Output directory for results |
- MiniGrid-Empty-5x5-v0: Simple navigation
- MiniGrid-LavaCrossingS9N1-v0: Obstacle avoidance
- MiniGrid-SimpleCrossingS11N5-v0: Basic crossing task
- And many more MiniGrid environments!
OpenAI:
- gpt-4o
- gpt-5
- gpt-5-mini
Google:
- gemini-2.5-pro
- gemini-2.5-flash
Each experiment generates:
-
Individual Step Images:
step_XXX_action_taken_ACTION.png- Embedded metadata (model, step, buffer size)
- 8x upscaled for visibility
- Clear action labeling
-
Trajectory GIF:
trajectory.gif- Complete episode visualization
- 400ms per frame
- Preserves all metadata overlays
-
Manifest JSON:
manifest.json- Experiment configuration
- Model and environment details
- Timestamp and unique ID
This framework is designed for:
- LLM Capability Assessment: How well do different models perform spatial reasoning?
- SARSA Learning Analysis: Does visual history improve decision making?
- Cross-Model Comparison: Systematic evaluation across providers
- Trajectory Analysis: Understanding agent behavior patterns
- Failure Mode Investigation: Identifying where agents get stuck
# Add support for new MiniGrid environments
python bench.py --env_id MiniGrid-YourCustomEnv-v0 --provider openai --model gpt-4o# Run multiple models on the same environment
for model in gpt-4o gpt-5 gemini-2.5-pro; do
python bench.py --env_id MiniGrid-LavaCrossingS9N1-v0 --model $model --episodes 5
doneimport json
import glob
# Load all experiments
experiments = []
for manifest_path in glob.glob("experiments/*/manifest.json"):
with open(manifest_path) as f:
experiments.append(json.load(f))
# Analyze success rates by model
# ... your analysis code hereContributions welcome! Areas of interest:
- New environment integrations
- Additional LLM providers
- Analysis and visualization tools
- Performance optimizations
- Documentation improvements
[License information to be added]
- MiniGrid: The underlying grid-world environment
- SARSA Algorithm: The reinforcement learning approach we implement
- LLM Agents: Recent advances in LLM-based autonomous agents
For questions, issues, or collaboration opportunities, please open an issue in this repository.

