17 Data Cleaning, Filtering, & Mixing Resources for Foundation Models

Data Cleaning

Add Resource

Text 16 Speech 1 Vision 2

Data Selection via Importance Resampling (DSIR)
A tool for selecting data with a similar distribution to a target dataset
Text
DataComp filtering
Various quality filters
Text Vision
DataComp pre-filtering
NSFW detection, dedup with eval datasets
Text Vision
Detoxify
A python library designed to identify toxic language in comments. Functions in seven languages: English, Italian, French, Russian, Portuguese, Spanish, Turking.
- GitHub
Text
Dolma's Toolkit
A Python framework for defining Taggers that identify non-language text, language ID, PII, toxic text, and “quality” text. Includes reimplementation of heuristics used by Gopher and C4 for non-natural language.
- GitHub
Text
DoReMi
A github repository for Domain Reweighting with Minimax Optimization
Text
fastText language classifier
A tool for classifying the language of text
- Hugging Face
Text
FUN-LangID
Frequently Used N-grams Language ID model, a character 4-gram model trained to recognize up to 1633 languages.
- GitHub
Text
GlotLID
A model for identifying languages, with support for more than 1600 languages.
Text
Langdetect
A tool to predict the language of text, used to filter out/in data from the desired languages
Text
Lilac
A python package for better understanding your data. Includes keyword and semantic search, as well as detection for PII, duplicates, and language.
Text
Online Data Mixing
A github repository for efficient online data mixing
Text
OpenLID
A model (and data used to train the model) for identifying 200+ languages.
Text
Roots data cleaning pipeline
A pipeline for processing and improving quality of crowdsourced datasets
- GitHub
Text
SpeechBrain’s Spoken language ID model
Pre-trained spoken language identification model trained on VoxLingua107, dataset of audio sourced from YouTube for 107 languages
Speech
The Pile processing scripts
A series of scripts to replicate the Pile dataset. Includes filtering and cleaning for: language, profanity, deduplication, and test set decontamination.
- GitHub
Text
Datatrove
A library to process, filter and deduplicate text data at a very large scale
- GitHub
Text

17 Data Cleaning, Filtering, & Mixing Resources for Foundation Models

Data Cleaning

Dolma's Toolkit