There is a bit of a growth industry in (pre-)training data preparation for LLM development. This page aims to offer navigational help in the dataset landscape, essentially providing a structured ‘catalogue’ of available resources. Originally, the catalogue is constructed for internal use in the OpenEuroLLM initiative, i.e. will put most emphasis on datasets used in the project. At the same time, we hope that this overview may become useful to others and can grow into a community-supported resource. The catalogue is accompanied by a curated collection of (a subset of) LLM (pre-)training datasets that are publicly made available (read-only) on multiple EuroHPC systems, currently:
- LUMI:
/appl/local/openeurollm/training/catalogue/
To nominate additional resources for inclusion in the catalogue or discuss specifics of emerging entries, please create a GitHub issue on this repository.
- C4 (Raffel, et al., 2019)
- The Pile (Gao, et al., 2020, Biderman, et al., 2022)
- RefinedWeb
- RedPajama
- SlimPajama
- Dolma
- FineWeb 1 (Penedo, et al., 2024)
- DCLM (Li, et al., 2024)
- Nemotron-CC (Su, et al., 2024)
- Common Pile (Kandpal, et al., 2025)
- mC4
- CulturaX (Nguyen, et al., 2024)
- HPLT (De Gilbert, et al., 2024, Burchell, et al. (2025))
- FineWeb 2 (https://github.com/huggingface/fineweb-2)
- MADLAD-400 (Kudugunta, et al., 2023)
- TxT360 (https://huggingface.co/datasets/LLM360/TxT360)
- Common Corpus (https://huggingface.co/blog/Pclanglais/common-corpus)
- StarCoder
- The Stack
- Proof-Pile-2 (Azerbayev et al., 2023)
- MegaMath
- OpenThoughts
The catalogue standardizes on the common JSONlines format, where each document is encoded as a JSON object comprising the document text and available metadata (as defined by each distinct resource), and each JSON object is serialized as a single line, i.e. without internal line breaks. For premium storage efficiency, all files are compressed using the Zstandard (ZSTD) protocol.
The project standardizes on three-letter language codes from ISO 639-3 and script (or writing system) codes from ISO 15924, e.g. ces_Latn
for Czech in Latin script or srp_Cyrl
for Serbian in Cyrillic script.
In April 2025, the project targets 36 distinct languages, with 42 internal variants, e.g. different scripts or written standards.
To obtain reasonably comparable statistics across different resources, the following metrics are defined:
- bytes: on-disk size in native format, e.g. compressed JSON or Parquet
- documents: number of documents, e.g. web pages, papers, books, or similar
- segments: number of paragraph-like units (e.g.
<h1>
,<p>
,<li>
,<pre>
in HTML) - characters: total volume in Unicode characters (including whitespace)
- tokens: sub-word units according to a common tokenizer (currently Gemma3)
Common pre-training datasets typically comprise large components of text derived from web data, e.g. from the Common Crawl, Internet Archive, or other initiatives, and sometimes also include non-web data, e.g. (out-of-copyright) books, government publications, scientific literature, et al. For various sources, text can be derived from different publishing formats, e.g. HTML documents or PDF files extracted from web crawls. It would be hard to devise a formal ontology to full ydescribe different data sources. Instead, catalogue entries provide a free-text, high-level indication of salient information, e.g. the origin of underlying web crawls and other data sources, original document formats, and such.