Skip to content

rishabhranawat/minigrid-bench-control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

MiniGrid Bench: SARSA-Based LLM Agent Evaluation

A comprehensive benchmarking framework for evaluating Large Language Models (LLMs) as autonomous agents in MiniGrid environments using SARSA (State-Action-Reward-State-Action) reinforcement learning principles.

๐ŸŽฏ Overview

This project implements a sophisticated evaluation framework that tests LLMs' ability to navigate and solve grid-world environments. Unlike traditional implementations, our approach uses true SARSA methodology by providing agents with visual history of previous states, enabling better spatial reasoning and trajectory understanding.

Key Features

  • ๐Ÿ” SARSA-Based Learning: Full state-action-reward-state-action implementation with visual state history
  • ๐ŸŽจ Rich Visualizations: Automatic trajectory GIF generation with embedded metadata
  • ๐Ÿ“Š Multi-Model Support: Compatible with OpenAI (GPT-4o, GPT-5) and Google (Gemini) models
  • ๐Ÿ“ˆ Comprehensive Logging: Detailed step-by-step tracking with replay buffer visualization
  • ๐ŸŒŸ Text Overlays: Model name, step number, and buffer size burned into images
  • ๐ŸŽฎ Multiple Environments: Support for various MiniGrid environments (Empty, LavaCrossing, SimpleCrossing, etc.)

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd minigrid-bench

# Install dependencies
pip install gymnasium minigrid pillow openai google-generativeai numpy

# Set up API keys
export OPENAI_API_KEY="your-openai-api-key"
export GEMINI_API_KEY="your-gemini-api-key"

Basic Usage

# Run a simple experiment with GPT-4o
python bench.py --env_id MiniGrid-Empty-5x5-v0 --provider openai --model gpt-4o --episodes 1 --max_steps 20 --replay_len 3

# Run with Gemini Pro
python bench.py --env_id MiniGrid-LavaCrossingS9N1-v0 --provider gemini --model gemini-2.5-pro --episodes 1 --max_steps 50 --replay_len 5

๐Ÿ“ธ Example Results

Simple Navigation (Empty 5x5)

Gemini 2.5 Pro on Empty 5x5

Complex Navigation (Lava Crossing)

Gemini 2.5 Pro on Lava Crossing

๐Ÿ—๏ธ Architecture

SARSA Implementation

Our implementation follows true SARSA methodology:

  1. State Representation: Visual observations as PNG images
  2. Action Space: Discrete actions (left, right, forward, pickup, drop, toggle, done)
  3. Reward Signal: Environment-provided rewards
  4. State History: Previous states stored in replay buffer
  5. Action Selection: LLM-based policy with visual context

Data Flow

Current State Image โ†’
    LLM (with Replay History) โ†’
        Action Selection โ†’
            Environment Step โ†’
                Reward + Next State โ†’
                    Update Replay Buffer

File Structure

minigrid-bench/
โ”œโ”€โ”€ bench.py              # Main benchmarking script
โ”œโ”€โ”€ utils.py               # Image processing and utilities
โ”œโ”€โ”€ experiments/           # Generated experiment data
โ”‚   โ”œโ”€โ”€ timestamp_env_model_id/
โ”‚   โ”‚   โ”œโ”€โ”€ manifest.json      # Experiment metadata
โ”‚   โ”‚   โ”œโ”€โ”€ step_*.png         # Individual step images
โ”‚   โ”‚   โ””โ”€โ”€ trajectory.gif     # Complete trajectory visualization
โ””โ”€โ”€ README.md             # This file

๐ŸŽ›๏ธ Configuration Options

Command Line Arguments

Argument Default Description
--env_id MiniGrid-Empty-5x5-v0 MiniGrid environment ID
--provider openai LLM provider (openai or gemini)
--model gpt-4o Model name
--replay_len 1 Replay buffer size (SARSA history length)
--episodes 1 Number of episodes to run
--max_steps 128 Maximum steps per episode
--seed 0 Random seed for reproducibility
--top_p 1.0 Nucleus sampling parameter
--experiments_dir experiments Output directory for results

Supported Environments

  • MiniGrid-Empty-5x5-v0: Simple navigation
  • MiniGrid-LavaCrossingS9N1-v0: Obstacle avoidance
  • MiniGrid-SimpleCrossingS11N5-v0: Basic crossing task
  • And many more MiniGrid environments!

Supported Models

OpenAI:

  • gpt-4o
  • gpt-5
  • gpt-5-mini

Google:

  • gemini-2.5-pro
  • gemini-2.5-flash

๐Ÿ“Š Output Format

Each experiment generates:

  1. Individual Step Images: step_XXX_action_taken_ACTION.png

    • Embedded metadata (model, step, buffer size)
    • 8x upscaled for visibility
    • Clear action labeling
  2. Trajectory GIF: trajectory.gif

    • Complete episode visualization
    • 400ms per frame
    • Preserves all metadata overlays
  3. Manifest JSON: manifest.json

    • Experiment configuration
    • Model and environment details
    • Timestamp and unique ID

๐Ÿ”ฌ Research Applications

This framework is designed for:

  • LLM Capability Assessment: How well do different models perform spatial reasoning?
  • SARSA Learning Analysis: Does visual history improve decision making?
  • Cross-Model Comparison: Systematic evaluation across providers
  • Trajectory Analysis: Understanding agent behavior patterns
  • Failure Mode Investigation: Identifying where agents get stuck

๐Ÿ› ๏ธ Advanced Usage

Custom Environments

# Add support for new MiniGrid environments
python bench.py --env_id MiniGrid-YourCustomEnv-v0 --provider openai --model gpt-4o

Batch Experiments

# Run multiple models on the same environment
for model in gpt-4o gpt-5 gemini-2.5-pro; do
    python bench.py --env_id MiniGrid-LavaCrossingS9N1-v0 --model $model --episodes 5
done

Analysis Scripts

import json
import glob

# Load all experiments
experiments = []
for manifest_path in glob.glob("experiments/*/manifest.json"):
    with open(manifest_path) as f:
        experiments.append(json.load(f))

# Analyze success rates by model
# ... your analysis code here

๐Ÿค Contributing

Contributions welcome! Areas of interest:

  • New environment integrations
  • Additional LLM providers
  • Analysis and visualization tools
  • Performance optimizations
  • Documentation improvements

๐Ÿ“„ License

[License information to be added]

๐Ÿ”— Related Work

  • MiniGrid: The underlying grid-world environment
  • SARSA Algorithm: The reinforcement learning approach we implement
  • LLM Agents: Recent advances in LLM-based autonomous agents

๐Ÿ“ง Contact

For questions, issues, or collaboration opportunities, please open an issue in this repository.

About

Direct Control on MiniGrid Bench

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published