ZeroEntropy AI Evaluation Suite

This repository contains ZeroEntropy's comprehensive evaluation suite for benchmarking and testing AI models across various retrieval and reranking tasks. The system supports multiple datasets, embedding methods, and reranking models with sophisticated metrics calculation.

Overview

The evaluation pipeline consists of four main stages:

Data Ingestion - Load and preprocess datasets from various sources
Embedding Generation - Create vector embeddings using different retrieval methods
Reranking - Apply reranking models to improve retrieval results
Metrics Calculation - Compute NDCG, Recall, and other evaluation metrics

Installation

Prerequisites

uv
CUDA-capable GPU (optional, for local model inference)

Setup

# Clone the repository
git clone <repository-url>
cd evals

# Install dependencies using uv (recommended) or pip
uv sync

# Set up environment variables
cp .env.example .env # Edit .env with your API keys and configuration

Required Environment Variables

# ZeroEntropy
ZEROENTROPY_API_KEY=your_zeroentropy_key
# OpenAI
OPENAI_API_KEY=your_openai_key
# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_key
# Cohere (optional)
COHERE_API_KEY=your_cohere_key
# VoyageAI (optional)
VOYAGEAI_API_KEY=your_voyage_key
# Jina AI (optional)
JINA_API_KEY=your_jina_key
# Modal (optional, for custom models)
MODAL_KEY=your_modal_key
MODAL_SECRET=your_modal_secret
# Baseten (optional)
BASETEN_API_KEY=your_baseten_key
# Together AI (optional)
TOGETHER_API_KEY=your_together_key

Quick Start

Running the Complete Pipeline

# Run the full evaluation pipeline (all stages)
python evals/run_pipeline.py

Running Individual Components

1. Data Ingestion

# Ingest all default datasets
python evals/run_ingestors.py

# Ingest with custom parameters
python -c "
from evals.run_ingestors import run_ingestors
from evals.types import ALL_INGESTORS
run_ingestors(ingestors=ALL_INGESTORS[:5], max_queries=50)
"

2. Generate Embeddings

# Generate embeddings with default settings
python evals/run_embeddings.py

# Generate embeddings with custom retrieval method
python -c "
import asyncio
from evals.run_embeddings import run_embeddings
asyncio.run(run_embeddings(retrieval_method='bm25'))
"

3. Run Rerankers

# Run reranking with default models
python evals/run_rerankers.py

# Run with specific rerankers
python -c "
import asyncio
from evals.run_rerankers import run_rerankers
asyncio.run(run_rerankers(rerankers=['cohere', 'zeroentropy-large']))
"

4. Calculate Metrics

# Calculate NDCG metrics
python evals/run_ndcg.py

# Calculate Recall metrics
python evals/run_recall.py

Configuration

Supported Retrieval Methods

openai_small - OpenAI text-embedding-3-small (default)
qwen3_4b - Qwen3 4B embedding model
qwen3_0.6b - Qwen3 0.6B embedding model
bm25 - BM25 keyword-based retrieval
hybrid - Combination of embedding and BM25 methods

Supported Rerankers

ZeroEntropy Models: zeroentropy-large, zeroentropy-small
Commercial APIs: cohere, jina, voyageai
Open Source: mixedbread, qwen, salesforce
Embedding-based: openai-large-embedding

Available Datasets

Original Datasets

FiQA - Financial question answering
BioASQ - Biomedical questions
StackOverflow QA - Programming questions
MS MARCO - Web search queries
Financial Benchmarks - FinQABench, FinanceBench
Code Datasets - MBPP, CosQA
Legal Datasets - Various legal document retrieval tasks

MTEB Datasets

Multilingual: TwitterHjerneRetrieval, MLQARetrieval, WikipediaRetrievalMultilingual
English: ArguAna, SCIDOCS, TRECCOVID, WinoGrande
Specialized: LEMBPasskeyRetrieval, TempReasonL1, SpartQA

New Datasets

Programming: LeetCode (Python, Java, JavaScript, C++)
Q&A: Quora, Quora Swedish
Documents: Meeting transcripts, NarrativeQA, Pandas documentation

Advanced Usage

Custom Configuration

from evals.run_pipeline import run_pipeline
from evals.types import MTEB_INGESTORS
import asyncio

# Run pipeline with MTEB datasets only
INGESTORS = MTEB_INGESTORS[:5]  # First 5 MTEB datasets
RETRIEVAL_METHOD = "hybrid"
RERANKERS = ["zeroentropy-large", "cohere"]

async def custom_run():
    await run_pipeline("ingestors", "ndcg")

asyncio.run(custom_run())

Adding Custom Datasets

from evals.ingestors.common import BaseIngestor
from evals.common import Document, Query, QRel

class CustomIngestor(BaseIngestor):
    def dataset_id(self) -> str:
        return "custom/my_dataset"

    def ingest(self) -> tuple[list[Query], list[Document], list[QRel]]:
        # Load your data
        queries = [Query(id="q1", query="What is AI?")]
        documents = [Document(id="d1", content="AI is artificial intelligence")]
        qrels = [QRel(query_id="q1", document_id="d1", score=1.0)]
        return queries, documents, qrels

Running on Train Split

To run evaluations on the training split of datasets instead of test/validation, you need to modify the dataset configuration. Many ingestors have a split parameter:

from evals.ingestors.master_mteb_ingestion import MasterMtebIngestor

# Example: Run on train split
train_ingestor = MasterMtebIngestor(
    task_name="TwitterHjerneRetrieval",
    dataset_name="twitterhjerneretrieval",
    language="dan-Latn",
    split="train"  # Change this to "train"
)

# Use in pipeline
from evals.run_ingestors import run_ingestors
run_ingestors(ingestors=[train_ingestor])

For custom datasets, modify the split parameter in the ingestor configuration in evals/types.py.

Output Structure

Results are stored in {ROOT}/data/datasets/ with the following structure:

data/datasets/
├── {dataset_id}/
│   ├── queries.jsonl           # Processed queries
│   ├── documents.jsonl         # Processed documents
│   ├── qrels.jsonl            # Relevance judgments
│   └── {retrieval_method}/
│       └── {merged|unmerged}/
│           ├── ze_results.jsonl        # Initial retrieval results
│           ├── embeddings_cache.db     # Embedding cache
│           └── {reranker}/
│               └── latest_ze_results.jsonl  # Reranked results

Performance Tips

Caching: Embeddings are cached automatically to speed up reruns
Parallel Processing: Reranking supports concurrent processing
Memory Management: Large datasets are processed in batches
Rate Limiting: Built-in rate limiting for all API providers
GPU Usage: Local models automatically use CUDA if available

Metrics

NDCG (Normalized Discounted Cumulative Gain)

Measures ranking quality with position-based discounting:

python evals/run_ndcg.py

Recall@K

Measures the fraction of relevant documents retrieved in top K:

python evals/run_recall.py

Both metrics support:

Per-dataset breakdowns
Statistical significance testing (standard error calculation)
Comparison across retrieval methods and rerankers

Development

Project Structure

evals/
├── ai.py                   # AI model interfaces and utilities
├── common.py              # Core data types and configurations
├── types.py               # Type definitions and defaults
├── utils.py               # Utility functions
├── run_*.py              # Main execution scripts
└── ingestors/            # Dataset-specific ingestion logic
    ├── common.py         # Base ingestor class
    ├── fiqa.py          # Example dataset ingestor
    └── ...              # Other dataset ingestors

Adding New Models

Embedding Models

Add to ALL_RERANKERS in evals/types.py:

"my-embedding": AIEmbeddingModel(company="my_company", model="my-model"),

Reranking Models

Add to ALL_RERANKERS in evals/types.py:

"my-reranker": AIRerankModel(company="my_company", model="my-model"),

Testing

# Run linting
./lint.sh

# Test individual components
python evals/run_ingestors.py
python -c "import asyncio; from evals.run_embeddings import run_embeddings; asyncio.run(run_embeddings())"

Troubleshooting

Common Issues

API Rate Limits: The system includes automatic rate limiting, but you may need to adjust limits in evals/ai.py

Memory Issues: Reduce batch sizes or dataset sizes:

run_ingestors(max_queries=50)  # Limit to 50 queries per dataset

Missing Dependencies: Ensure all optional dependencies are installed:
```
pip install torch sentence-transformers rank_bm25
```

GPU Issues: Set device explicitly:

# In evals/ai.py, modify DEVICE variable
DEVICE = "cpu"  # Force CPU usage

Getting Help

Check the logs in {ROOT}/logs/*
Review the configuration in evals/types.py
Examine individual ingestor implementations in evals/ingestors/*
Ask us on Slack or Discord

License

This project is developed by the ZeroEntropy AI team. See the license file for usage terms.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
evals		evals
.gitignore		.gitignore
README.md		README.md
lint.sh		lint.sh
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
ruff.toml		ruff.toml
uv.lock		uv.lock

zeroentropy-ai/evals

Folders and files

Latest commit

History

Repository files navigation