Removing data duplicates can 1) reduce the likelihood of memorizing undesirable pieces of information such as boilerplate text, copyrighted data, and personally identifiable information, 2) improves training efficiency by reducing the total dataset size. Practitioners should always determine whether duplicated data will harm or help the model for their use case.
6 Data Deduplication Resources for Foundation Models
- Home /
- Foundation Model Resources /
- Data Deduplication Resources for Foundation Models
Data Deduplication
Apricot
apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.
Text Speech VisionGoogle Text Deduplication
A repository to deduplicate language model datasets. They release the ExactSubstr deduplication implementation (written in Rust) along with scripts to perform ExactSubstr deduplication and inspect the results (written in Python). They also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.
TextRedPajama-Data
Tools for: exact deduplication with bloom filter, fuzzy deduplication with LSH, calculating quality scores
Text