Foundation Model Resources

Welcome to our curated collection of resources for responsible foundation model development! Here, you’ll find a diverse array of tools, artifacts, and insightful papers aimed at guiding developers in navigating the complexities of model development. Our selection criteria emphasize the usefulness, documentation quality, and community awareness of each resource.

Data Sources
- Pretraining Data Sources
  Text 26 Vision 5 Speech 12
  Understand the importance of pretraining data for foundation models. Careful data selection impacts model behavior and capabilities.
- Finetuning Data Catalogs
  Text 10 Vision 2 Speech 11
  Discover the breadth of finetuning data sources available for foundation models. From HuggingFace Datasets to specialized catalogs, find resources with strong documentation and diverse data sets.
Data Preparation
Data Documentation and Release
- Data Documentation
  Text 6 Vision 6 Speech 6
  Understand the significance of data documentation for foundation model datasets. Thorough documentation ensures users understand data usage, legal restrictions, and privacy concerns, despite potential errors in crowdsourced documentation.
- Data Governance
  Text 7 Vision 4 Speech 3
  Explore data governance practices for foundation model datasets. Learn about data curation, access control, and enabling data subjects to request removal from hosted datasets to ensure compliance with privacy and legal requirements.
Model Training
Environmental Impact
- Environmental Impact
  Text 10 Vision 7 Speech 7
  Explore resources for estimating and mitigating the environmental impact of foundation model development. Learn about tools and methodologies for measuring energy consumption during training or inference and minimizing environmental impact throughout the model lifecycle.
Model Evaluation
Model Release and Evaluation

Frequently Asked Questions (FAQ)

The foundation model cheatsheet is a curated collection of tools, datasets, code examples, and papers that guide the development of foundation models (large language models, image generators, etc.). It’s designed for newer developers who want to build and release these models responsibly.

This cheatsheet was created to: 1) Help developers navigate the complex landscape of responsible foundation model development. 2) Provide guidance on mitigating potential misuses or harms of these models. 3) Highlight helpful resources that might not be widely known.

The cheatsheet includes: data catalogs (especially for less common languages), tools for searching and analyzing data, repositories for evaluating models, and papers that summarize key development decisions.

The cheatsheet is a living document! You can contribute new resources by following the instructions given on the website. Your contributions will be reviewed for relevance and quality.

It’s primarily aimed at newer foundation model developers. Larger organizations that build commercial AI products have additional factors to consider.

Currently, it’s focused on text, vision, and speech models. The creators acknowledge that this is just a starting point.

Foundation Model Resources

Data Sources

Pretraining Data Sources

Finetuning Data Catalogs

Data Preparation

Data Exploration

Data Cleaning

Data Deduplication

Data Decontamination

Data Auditing

Data Documentation and Release

Data Documentation

Data Governance

Model Training

Pretraining Repositories

Finetuning Repositories

Efficiency & Resource Allocation

Additional Educational Resources

Environmental Impact

Environmental Impact

Model Evaluation

Capabilities

Risks & Harms Taxonomies

Risks & Harms Evaluation

Model Release and Evaluation

Model Documentation

Reproducibility

License Selection

Usage Monitoring

Frequently Asked Questions (FAQ)