Pretraining Data Sources

Add Resource

Text 28 Speech 13 Vision 6 Video 1 Tabular 1

C4
An English, cleaned version of Common Crawl’s web crawl corpus (https://commoncrawl.org ).
Text
Common Voice
28k hours [as of 11/2023] of crowd-sourced read speech from 100+ languages
- Website
Speech
CulturaX
A pertaining dataset of 16T tokens, covering 167 languages, cleaned, deduplicated, and refined. Combines mC4 into 2020, with OSCAR project data up to 2023.
Text
DataComp-1B and CommonPool-13B
A large pool of 13B image-text pairs from CommonCrawl and a curated 1B subset
Text Vision
Dolma
A pretraining dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Text
GigaSpeech
40k hours (10k transcribed) multi-domain English speech corpus
Speech
Golos
1,240 hours of crowd-sourced Russian speech
Speech
IndicCorp v2
A multilingual pre-training corpus for 24 Indian languages
Text
IndicSUPERB
1,684 hour crowd-sourced corpus of 12 Indian languages
Speech
Libri-Light
60k hour read English speech from LibriVox audiobooks
Speech
LibriSpeech
960 hour read English speech from LibriVox audiobooks
Speech
MADLAD-400
A manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages.
Text
mC4
The fully multilingual, cleaned version of Common Crawl’s web crawl corpus (https://commoncrawl.org ).
Text
MMC4
Interleaved image-text data from Common Crawl (570M images, 43B tokens)
Text Vision
OBELICS
Interleaved image-text data from Common Crawl (353 M images, 115B tokens)
Text Vision
OLC
The Open License Corpus is a 228B token corpus of permissively-licensed, primarily English text data for pretraining.
Text
OpenWebMath
A dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens.
Text
OPUS
The Open Parallel Corpus is a massive collection of translated text pairs from the web.
- Website
Text
OSCAR
The Open Super-large Crawled Aggregated coRpus provides web-based multilingual datasets across 166 languages.
Text
peS2o
A collection of ~40M creative open-access academic papers, cleaned, filtered, and formatted for pre-training of language models, originally derived from the Semantic Scholar Open Research Corpus (S2ORC).
Text
Pile of Law
An open-source, English dataset with ∼256GB of legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records.
Text
RedPajama v2
A pretraining dataset of 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.
Text
ROOTS
A massive multilingual pretraining corpus from BigScience, comprised of 1.6TB of text spanning 59 languages. It is a mix of OSCAR (https://oscar-project.org/ ) and the datasets found in the BigScience Catalogue (https://huggingface.co/spaces/bigscience/SourcingCatalog) .
Text
Samrómur
2,200 hour crowd-sourced corpus of Icelandic speech
- Website
Speech
Shrutilipi
6,400 hour corpus of TV/Radio broadcasts from 12 Indian languages
Speech
The People’s Speech
30k hour conversational English dataset
Speech
The Pile
An 825GB English pretraining corpus that mixes portions of common crawl with 22 smaller, high-quality datasets combined together.
Text
The Proof Pile 2
The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents.
- Download Paper
- Hugging Face
- GitHub
- Website
- $Github Stars$
Text
The RefinedWeb
An English-only, web-only, deduplicated pretraining dataset of five trillion tokens.
Text
The Stack
The Stack is a 6TB, permissively-licensed pretraining dataset from active GitHub repositories covering 358 programming languages.
Text
VoxPopuli
400k hours of unlabelled speech from 23 languages of the European parliament
Speech
WebVid-10M
10M videos with captions
Text Vision
WenetSpeech
22.4k hour multi-domain corpus of Mandarin
Speech
WURA
A manually audited multilingual pre-training corpus (document-level dataset) for 16 African languages and four high-resource languages widely spoken in Africa (English, French, Arabic and Portuguese)
Text
WebDatasets
A dataset format for high-performance streaming of data. Especially useful for modalities other than language that are more I/O intensive for training’, such as images, video, or audio.
- GitHub
Text Speech Vision
Multi Legal Pile
A large-scale multilingual legal dataset and superset of the Pile of Law, suited for pretraining language models. It spans over 24 languages and five legal text types.
Text
French-PD-Newpapers
Nearly three million unique newspaper and periodical editions (70B words) from the French National Library.
- Hugging Face
Text
OpenWebText
An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2.
Text
Croissant
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.
Text Vision Speech Video Tabular

39 Pretraining Data Sources

Pretraining Data Sources