7 Data Decontamination Resources for Foundation Models

Data Decontamination

Add Resource

Text 7 Speech 1 Vision 1 Video 1 Tabular 1

BigBench Canaries
BigBench’s “Training on the Test Set” Task provies guidance on using canaries to check if an evaluation set was trained on.
- GitHub
Text
Carper AI Decontamination Tool
A repository, heavily based by the BigCode repository, to decontaminate evaluation sets from a text training set.
- GitHub
Text
Data Portraits
A tool to test for membership inference of popular datasets, like The Pile or The Stack, i.e. whether a model has seen certain data.
Text
Detect Pretrain Data (Min-K Prob)
Detect Pretrain Data (Min-K Prob)
Text
Interpreting Canary Exposure
An explanation on how to interpret canary exposure, including by relating it to membership inference attacks, and differential privacy.
- Download Paper
Text
Proving Test Set Contamination in Black Box Language Models
A paper that provides methods for provable guarantees of test set contamination in language models without access to pretraining data or model weights.
- Download Paper
Text
Croissant
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.
Text Vision Speech Video Tabular

7 Data Decontamination Resources for Foundation Models

Data Decontamination

BigBench Canaries

Carper AI Decontamination Tool

Data Portraits

Detect Pretrain Data (Min-K Prob)

Interpreting Canary Exposure

Proving Test Set Contamination in Black Box Language Models

Croissant