Releasing all datasets involved in the development of a Foundation Model, including training, fine-tuning, and evaluation data, can facilitate external scrutiny and support further research. Proper data governance practices, including respecting opt-out preference signals, pseudonymization, or PII redaction, are required at the curation and release stages. Data access control based on research needs and enabling data subjects to request removal from the hosted dataset are essential.
7 Data Governance Resources for Foundation Models
- Home /
- Foundation Model Resources /
- Data Governance Resources for Foundation Models
Data Governance
Data Governance in the Age of Large-Scale Data-Driven Language Technology
A paper detailing the data governance decisions undertaken during BigScience’s BLOOM project.
Text Speech VisionReclaiming the Digital Commons: A Public Data Trust for Training Data
A paper that argues for the creation of a public data trust for collective input into the creation of AI systems and analyzes the feasibility of such a data trust.
Text Speech VisionBigCode Governance Card
A report outlining governance questions, approaches, and tooling in the BigCode project, with a focus on Data governance
TextAmIinTheStack
A tool to let software developers check whether their code was included in TheStack dataset and opt out of inclusion in future versions
TextStarPII: BigCode Pseudonymization Model
A model trained on a new dataset of PII in code used for pseudonymization of a dataset prior to training
TextFrench DPA Resource sheets on AI and GDPR
A set of resource sheets focused on GDPR compliance covering legal basis for data collection, sharing, and best practices for handling personal data
Text Speech Vision