GitHub - ReML-AI/IRLBench

IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Overview

Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish and critical insights, in which models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model. We release IRLBench and an accompanying evaluation codebase to enable future research on robust, culturally aware multilingual AI development.

Directory Structure

extract_problems_marking_scheme.py: Pipeline to extract data from PDF images of the Irish Leaving Certificate examination.
generate_response.py: Generates LLMs outputs for exam questions.
generate_judgement.py: Evaluates model responses using judge models
get_results.py: Functions for data loading and results processing
run_analysis.py: Main script for running analysis and generating visualizations
visualize_results.py: Functions for creating various visualization types

Getting Started

Clone this repository
Installed required dependencies
Set up your API keys in a .env file

Usage

Run python extract_problems_marking_scheme.py to perform the data colletion pipeline. Here, as we have collected and processed the dataset, we can use IRLBench directly:

from datasets import load_dataset
ds = load_dataset("ReliableAI/IRLBench")

To generate responses from candidate model:

python generate_response.py --model MODEL_NAME

Using LLM-as-a-judge to generate judgement:

python generate_judgement.py --student_model MODEL_NAME --judge_model JUDGE_MODEL

To run a analysis of the raw results:

python run_analysis.py --model MODEL_NAME --judge_model JUDGE_MODEL --judgements_dir ./judgements --responses_dir ./responses

Where:

MODEL_NAME: The model to be evaluated (e.g., "gemini-2.0-flash", "o4-mini")
JUDGE_MODEL: The model used to judge answers (e.g., "gemini-2.5-flash")

Then, the raw results should be added to the function prepare_visualization_data in visualize_results.py before running visualization:

python visualize_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Overview

Directory Structure

Getting Started

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
judgements		judgements
output		output
responses		responses
.env_example		.env_example
extract_problems_marking_scheme.py		extract_problems_marking_scheme.py
generate_judgement.py		generate_judgement.py
generate_response.py		generate_response.py
readme.md		readme.md
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
visualize_results.py		visualize_results.py

ReML-AI/IRLBench

Folders and files

Latest commit

History

Repository files navigation

IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Overview

Directory Structure

Getting Started

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages