OpenEuroLLM Catalogue of LLM Training Data

Background

There is a bit of a growth industry in (pre-)training data preparation for LLM development. This page aims to offer navigational help in the dataset landscape, essentially providing a structured ‘catalogue’ of available resources. Originally, the catalogue is constructed for internal use in the OpenEuroLLM initiative, i.e. will put most emphasis on datasets used in the project. At the same time, we hope that this overview may become useful to others and can grow into a community-supported resource. The catalogue is accompanied by a curated collection of (a subset of) LLM (pre-)training datasets that are publicly made available (read-only) on multiple EuroHPC systems, currently:

LUMI: /appl/local/openeurollm/training/catalogue/

To nominate additional resources for inclusion in the catalogue or discuss specifics of emerging entries, please create a GitHub issue on this repository.

(Mostly) English Pre-Training Data

C4 (Raffel, et al., 2019)
The Pile (Gao, et al., 2020, Biderman, et al., 2022)
RefinedWeb
RedPajama
SlimPajama
Dolma
FineWeb 1 (Penedo, et al., 2024)
DCLM (Li, et al., 2024)
Nemotron-CC (Su, et al., 2024)
Common Pile (Kandpal, et al., 2025)

Multilingual Pre-Training Data

mC4
CulturaX (Nguyen, et al., 2024)
HPLT (De Gilbert, et al., 2024, Burchell, et al. (2025))
FineWeb 2 (https://github.com/huggingface/fineweb-2)
MADLAD-400 (Kudugunta, et al., 2023)
TxT360 (https://huggingface.co/datasets/LLM360/TxT360)
Common Corpus (https://huggingface.co/blog/Pclanglais/common-corpus)

Parallel Pre-Training Data

Non-Language Pre-Training Data

Coding

StarCoder
The Stack

Math and Reasoning

Proof-Pile-2 (Azerbayev et al., 2023)
MegaMath
OpenThoughts

Catalogue Conventions

Data Organization

The catalogue standardizes on the common JSONlines format, where each document is encoded as a JSON object comprising the document text and available metadata (as defined by each distinct resource), and each JSON object is serialized as a single line, i.e. without internal line breaks. For premium storage efficiency, all files are compressed using the Zstandard (ZSTD) protocol.

Language and Script References

The project standardizes on three-letter language codes from ISO 639-3 and script (or writing system) codes from ISO 15924, e.g. ces_Latn for Czech in Latin script or srp_Cyrl for Serbian in Cyrillic script.

In April 2025, the project targets 36 distinct languages, with 42 internal variants, e.g. different scripts or written standards.

Corpus Statistics

To obtain reasonably comparable statistics across different resources, the following metrics are defined:

bytes: on-disk size in native format, e.g. compressed JSON or Parquet
documents: number of documents, e.g. web pages, papers, books, or similar
segments: number of paragraph-like units (e.g. <h1>, <p>, <li>, <pre> in HTML)
characters: total volume in Unicode characters (including whitespace)
tokens: sub-word units according to a common tokenizer (currently Gemma3)

Data Sources

Common pre-training datasets typically comprise large components of text derived from web data, e.g. from the Common Crawl, Internet Archive, or other initiatives, and sometimes also include non-web data, e.g. (out-of-copyright) books, government publications, scientific literature, et al. For various sources, text can be derived from different publishing formats, e.g. HTML documents or PDF files extracted from web crawls. It would be hard to devise a formal ontology to full ydescribe different data sources. Instead, catalogue entries provide a free-text, high-level indication of salient information, e.g. the origin of underlying web crawls and other data sources, original document formats, and such.

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
c4/3.1.0/etc		c4/3.1.0/etc
common-pile/0.1		common-pile/0.1
dclm/1.0		dclm/1.0
etc		etc
finepdfs/1.0.0		finepdfs/1.0.0
fineweb		fineweb
hplt		hplt
madlad-400		madlad-400
nemotron-cc/1.0		nemotron-cc/1.0
.gitignore		.gitignore
README.md		README.md
languages		languages
skeleton.md		skeleton.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenEuroLLM Catalogue of LLM Training Data

Background

(Mostly) English Pre-Training Data

Multilingual Pre-Training Data

Parallel Pre-Training Data

Non-Language Pre-Training Data

Coding

Math and Reasoning

Catalogue Conventions

Data Organization

Language and Script References

Corpus Statistics

Data Sources

Licensing Information

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

OpenEuroLLM/training-data-catalogue

Folders and files

Latest commit

History

Repository files navigation

OpenEuroLLM Catalogue of LLM Training Data

Background

(Mostly) English Pre-Training Data

Multilingual Pre-Training Data

Parallel Pre-Training Data

Non-Language Pre-Training Data

Coding

Math and Reasoning

Catalogue Conventions

Data Organization

Language and Script References

Corpus Statistics

Data Sources

Licensing Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages