Data quality is crucial. Filtering can remove unwanted data, improving training efficiency and ensuring desirable properties like high information content, desired languages, low toxicity, and minimal personally identifiable information. Consider trade-offs when using filters and understand the importance of data mixtures.
17 Data Cleaning, Filtering, & Mixing Resources for Foundation Models
- Home /
- Foundation Model Resources /
- Data Cleaning, Filtering, & Mixing Resources for Foundation Models
Data Cleaning
Data Selection via Importance Resampling (DSIR)
A tool for selecting data with a similar distribution to a target dataset
TextDetoxify
A python library designed to identify toxic language in comments. Functions in seven languages: English, Italian, French, Russian, Portuguese, Spanish, Turking.
TextDolma's Toolkit
A Python framework for defining Taggers that identify non-language text, language ID, PII, toxic text, and “quality” text. Includes reimplementation of heuristics used by Gopher and C4 for non-natural language.
TextFUN-LangID
Frequently Used N-grams Language ID model, a character 4-gram model trained to recognize up to 1633 languages.
TextLangdetect
A tool to predict the language of text, used to filter out/in data from the desired languages
TextLilac
A python package for better understanding your data. Includes keyword and semantic search, as well as detection for PII, duplicates, and language.
TextRoots data cleaning pipeline
A pipeline for processing and improving quality of crowdsourced datasets
TextSpeechBrain’s Spoken language ID model
Pre-trained spoken language identification model trained on VoxLingua107, dataset of audio sourced from YouTube for 107 languages
SpeechThe Pile processing scripts
A series of scripts to replicate the Pile dataset. Includes filtering and cleaning for: language, profanity, deduplication, and test set decontamination.
Text