9 Data Governance Resources for Foundation Models

Data Governance

Add Resource

Text 9 Speech 5 Vision 6 Video 1 Tabular 1

HaveIBeenTrained
A combination search tool / opt out tool for LAION
- Website
Text Vision
Data Governance in the Age of Large-Scale Data-Driven Language Technology
A paper detailing the data governance decisions undertaken during BigScience’s BLOOM project.
Text Speech Vision
Reclaiming the Digital Commons: A Public Data Trust for Training Data
A paper that argues for the creation of a public data trust for collective input into the creation of AI systems and analyzes the feasibility of such a data trust.
- Download Paper
Text Speech Vision
BigCode Governance Card
A report outlining governance questions, approaches, and tooling in the BigCode project, with a focus on Data governance
- Download Paper
Text
AmIinTheStack
A tool to let software developers check whether their code was included in TheStack dataset and opt out of inclusion in future versions
- Hugging Face
Text
StarPII: BigCode Pseudonymization Model
A model trained on a new dataset of PII in code used for pseudonymization of a dataset prior to training
- Hugging Face
Text
French DPA Resource sheets on AI and GDPR
A set of resource sheets focused on GDPR compliance covering legal basis for data collection, sharing, and best practices for handling personal data
- Download Paper
Text Speech Vision
AI Verify
An AI governance testing framework and software toolkit that help industries be more transparent about their AI to build trust. Foundation is fostering a community to contribute to the use and development of AI testing frameworks, code base, standards, and best practices.
Text Vision Speech
Croissant
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.
Text Vision Speech Video Tabular

9 Data Governance Resources for Foundation Models

Data Governance

French DPA Resource sheets on AI and GDPR