Practitioners should consider using already-optimized codebases, especially in the pre-training phase, to ensure effective use of computational resources, capital, power, and effort. Existing open-source codebases targeted at foundation model pretraining can be significantly more accessible to new practitioners and help contribute to efficient training strategies.
37 Pretraining Data Sources
- Home /
- Foundation Model Resources /
- Pretraining Data Sources
Pretraining Data Sources
CulturaX
A pertaining dataset of 16T tokens, covering 167 languages, cleaned, deduplicated, and refined. Combines mC4 into 2020, with OSCAR project data up to 2023.
TextDataComp-1B and CommonPool-13B
A large pool of 13B image-text pairs from CommonCrawl and a curated 1B subset
Text VisionDolma
A pretraining dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
TextMADLAD-400
A manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages.
TextmC4
The fully multilingual, cleaned version of Common Crawl’s web crawl corpus (https://commoncrawl.org ).
TextOLC
The Open License Corpus is a 228B token corpus of permissively-licensed, primarily English text data for pretraining.
TextOpenWebMath
A dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens.
TextOSCAR
The Open Super-large Crawled Aggregated coRpus provides web-based multilingual datasets across 166 languages.
TextpeS2o
A collection of ~40M creative open-access academic papers, cleaned, filtered, and formatted for pre-training of language models, originally derived from the Semantic Scholar Open Research Corpus (S2ORC).
TextPile of Law
An open-source, English dataset with ∼256GB of legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records.
TextRedPajama v2
A pretraining dataset of 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.
TextROOTS
A massive multilingual pretraining corpus from BigScience, comprised of 1.6TB of text spanning 59 languages. It is a mix of OSCAR (https://oscar-project.org/ ) and the datasets found in the BigScience Catalogue (https://huggingface.co/spaces/bigscience/SourcingCatalog) .
TextThe Pile
An 825GB English pretraining corpus that mixes portions of common crawl with 22 smaller, high-quality datasets combined together.
TextThe Proof Pile 2
The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents.
TextThe RefinedWeb
An English-only, web-only, deduplicated pretraining dataset of five trillion tokens.
TextThe Stack
The Stack is a 6TB, permissively-licensed pretraining dataset from active GitHub repositories covering 358 programming languages.
TextWURA
A manually audited multilingual pre-training corpus (document-level dataset) for 16 African languages and four high-resource languages widely spoken in Africa (English, French, Arabic and Portuguese)
TextWebDatasets
A dataset format for high-performance streaming of data. Especially useful for modalities other than language that are more I/O intensive for training’, such as images, video, or audio.
Text Speech VisionMulti Legal Pile
A large-scale multilingual legal dataset and superset of the Pile of Law, suited for pretraining language models. It spans over 24 languages and five legal text types.
TextFrench-PD-Newpapers
Nearly three million unique newspaper and periodical editions (70B words) from the French National Library.
Text