Data decontamination is the process of removing evaluation data from the training set. This step ensures the integrity of model evaluation. The following resources aid in proactively protecting test data with canaries, decontaminating data before training, and identifying or proving what data a model was trained on.
6 Data Decontamination Resources for Foundation Models
- Home /
- Foundation Model Resources /
- Data Decontamination Resources for Foundation Models
Data Decontamination
Text 6
Speech 0
Vision 0
BigBench Canaries
BigBench’s “Training on the Test Set” Task provies guidance on using canaries to check if an evaluation set was trained on.
TextCarper AI Decontamination Tool
A repository, heavily based by the BigCode repository, to decontaminate evaluation sets from a text training set.
TextData Portraits
A tool to test for membership inference of popular datasets, like The Pile or The Stack, i.e. whether a model has seen certain data.
TextInterpreting Canary Exposure
An explanation on how to interpret canary exposure, including by relating it to membership inference attacks, and differential privacy.
TextProving Test Set Contamination in Black Box Language Models
A paper that provides methods for provable guarantees of test set contamination in language models without access to pretraining data or model weights.
Text