8 Data Deduplication Resources for Foundation Models

Data Deduplication

Add Resource

Text 7 Speech 2 Vision 3 Video 1 Tabular 1

Apricot
apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.
Text Speech Vision
Datacomp image dedup
Data to deduplicate vision datasets for the Datacomp challenge.
Vision
Dolma Dedupe Tool
Dolma’s text deduplication tool for pretraining data
- GitHub
Text
Google Text Deduplication
A repository to deduplicate language model datasets. They release the ExactSubstr deduplication implementation (written in Rust) along with scripts to perform ExactSubstr deduplication and inspect the results (written in Python). They also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.
Text
RedPajama-Data
Tools for: exact deduplication with bloom filter, fuzzy deduplication with LSH, calculating quality scores
- GitHub
Text
Pile
A set of tools for deduplication with MinHashLSH
Text
SlimPajama
We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion
Text
Croissant
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.
Text Vision Speech Video Tabular