9 Data Governance Resources for Foundation Models

Releasing all datasets involved in the development of a Foundation Model, including training, fine-tuning, and evaluation data, can facilitate external scrutiny and support further research. Proper data governance practices, including respecting opt-out preference signals, pseudonymization, or PII redaction, are required at the curation and release stages. Data access control based on research needs and enabling data subjects to request removal from the hosted dataset are essential.

Data Governance Resources for Foundation Models

Data Governance

Text 9 Speech 5 Vision 6 Video 1 Tabular 1
  • HaveIBeenTrained Webpage

    HaveIBeenTrained

    A combination search tool / opt out tool for LAION

    Text Vision
  • Data Governance in the Age of Large-Scale Data-Driven Language Technology Paper

    Data Governance in the Age of Large-Scale Data-Driven Language Technology

    A paper detailing the data governance decisions undertaken during BigScience’s BLOOM project.

  • Reclaiming the Digital Commons: A Public Data Trust for Training Data Paper

    Reclaiming the Digital Commons: A Public Data Trust for Training Data

    A paper that argues for the creation of a public data trust for collective input into the creation of AI systems and analyzes the feasibility of such a data trust.

    Text Speech Vision
  • BigCode Governance Card Paper

    BigCode Governance Card

    A report outlining governance questions, approaches, and tooling in the BigCode project, with a focus on Data governance

  • AmIinTheStack Hugging Face

    AmIinTheStack

    A tool to let software developers check whether their code was included in TheStack dataset and opt out of inclusion in future versions

  • StarPII: BigCode Pseudonymization Model Hugging Face

    StarPII: BigCode Pseudonymization Model

    A model trained on a new dataset of PII in code used for pseudonymization of a dataset prior to training

  • French DPA Resource sheets on AI and GDPR Webpage

    French DPA Resource sheets on AI and GDPR

    A set of resource sheets focused on GDPR compliance covering legal basis for data collection, sharing, and best practices for handling personal data

    Text Speech Vision
  • AI Verify GitHub

    AI Verify

    An AI governance testing framework and software toolkit that help industries be more transparent about their AI to build trust. Foundation is fostering a community to contribute to the use and development of AI testing frameworks, code base, standards, and best practices.

    Text Vision Speech
  • Croissant Website

    Croissant

    Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.

    Text Vision Speech Video Tabular