8 Data Deduplication Resources for Foundation Models

Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.

Data Deduplication Resources for Foundation Models

Data Deduplication

Text 7 Speech 2 Vision 3 Video 1 Tabular 1
  • Apricot

    apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.

    Text Speech Vision
  • Datacomp image dedup

    Data to deduplicate vision datasets for the Datacomp challenge.

  • Dolma Dedupe Tool

    Dolma’s text deduplication tool for pretraining data

  • Google Text Deduplication

    A repository to deduplicate language model datasets. They release the ExactSubstr deduplication implementation (written in Rust) along with scripts to perform ExactSubstr deduplication and inspect the results (written in Python). They also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

  • RedPajama-Data

    Tools for: exact deduplication with bloom filter, fuzzy deduplication with LSH, calculating quality scores

  • Pile

    A set of tools for deduplication with MinHashLSH

  • SlimPajama Webpage

    SlimPajama

    We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion

  • Croissant Website

    Croissant

    Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.

    Text Vision Speech Video Tabular