Welcome to our curated collection of resources for responsible foundation model development! Here, you’ll find a diverse array of tools, artifacts, and insightful papers aimed at guiding developers in navigating the complexities of model development. Our selection criteria emphasize the usefulness, documentation quality, and community awareness of each resource.
Foundation Model Resources
- Home /
- Foundation Model Resources
Data Sources
Pretraining Data Sources
Text 26 Vision 5 Speech 12Understand the importance of pretraining data for foundation models. Careful data selection impacts model behavior and capabilities.
Finetuning Data Catalogs
Text 10 Vision 2 Speech 11Discover the breadth of finetuning data sources available for foundation models. From HuggingFace Datasets to specialized catalogs, find resources with strong documentation and diverse data sets.
Data Preparation
Data Exploration
Text 10 Vision 2 Speech 1Learn how to explore and analyze training datasets effectively for foundation models. Understand the nuances of data distributions, topics, and formats to better train your model.
Data Cleaning
Text 16 Vision 2 Speech 1Master data cleaning, filtering, and mixing techniques for foundation model datasets.
Data Deduplication
Text 5 Vision 2 Speech 1Learn about data deduplication, a crucial preprocessing step for foundation model datasets. Discover how removing duplicates enhances model training efficiency and reduces the risk of memorizing undesirable information.
Data Decontamination
Text 6 Vision 0 Speech 0Explore data decontamination techniques for foundation model training datasets. Learn how to protect test data integrity and ensure reliable model evaluation with canaries and proactive decontamination methods.
Data Auditing
Text 9 Vision 5 Speech 1Discover the importance of auditing datasets in foundation model development. Learn how systematic studies and exploration tools can ensure dataset integrity and effectiveness.
Data Documentation and Release
Data Documentation
Text 6 Vision 6 Speech 6Understand the significance of data documentation for foundation model datasets. Thorough documentation ensures users understand data usage, legal restrictions, and privacy concerns, despite potential errors in crowdsourced documentation.
Data Governance
Text 7 Vision 4 Speech 3Explore data governance practices for foundation model datasets. Learn about data curation, access control, and enabling data subjects to request removal from hosted datasets to ensure compliance with privacy and legal requirements.
Model Training
Pretraining Repositories
Text 7 Vision 3 Speech 2Discover pretraining repositories for foundation model development. Explore existing open-source codebases tailored for pretraining to optimize computational resources and enhance accessibility for new practitioners.
Finetuning Repositories
Text 12 Vision 9 Speech 3Explore finetuning repositories for foundation model development. Access resources for adapting foundation models after pretraining to ensure greater ecosystem compatibility and reduce barriers to experimentation.
Efficiency & Resource Allocation
Text 5 Vision 2 Speech 2Learn about efficiency and resource allocation in foundation model training. Explore resources and best practices for optimizing resource usage, reducing training costs, and maximizing the environmental impact of model training.
Additional Educational Resources
Text 6 Vision 2 Speech 2Discover educational resources for foundation model training. Access materials to learn about the considerations and best practices for successfully training or fine-tuning foundation models.
Environmental Impact
Model Evaluation
Capabilities
Text 20 Vision 8 Speech 3Explore evaluation capabilities for foundation models. Understand the challenges in evaluating open-ended use cases and discover benchmarks and methodologies for assessing model performance in diverse tasks and applications.
Risks & Harms Taxonomies
Text 25 Vision 24 Speech 25Discover taxonomies for evaluating risks and harms in foundation models. Learn about categorizing and understanding risks and hazards associated with AI systems, including issues related to hate speech, cybersecurity, and misuse of AI capabilities.
Risks & Harms Evaluation
Text 20 Vision 6 Speech 5Explore evaluations of risks and harms in foundation models. Understand the importance of assessing risks and harms, and discover methodologies and taxonomies for evaluating potential risks, mitigations, and decision-making in model development and deployment.
Model Release and Evaluation
Model Documentation
Text 5 Vision 5 Speech 5Learn about model documentation for foundation models. Discover standards and tools for effectively documenting models, including specifications for model usage, recommended use cases, potential risks, and decisions made during training.
Reproducibility
Text 7 Vision 6 Speech 6Understand the importance of reproducibility in foundation model development. Learn about the challenges of replicating evaluation results and discover best practices for ensuring scientific reproducibility through clear code, documentation, and setup.
License Selection
Text 17 Vision 16 Speech 16Explore license selection considerations for foundation models. Learn about different types of licenses and their implications for model distribution, use, and adaptation. Discover resources and examples to help guide developers in selecting appropriate licenses for their models.
Usage Monitoring
Text 8 Vision 8 Speech 7Discover resources for usage monitoring in foundation model development. Explore techniques for monitoring model usage, including watermarking, access control, and reporting adverse events. Learn about challenges and considerations in implementing usage monitoring strategies.
Frequently Asked Questions (FAQ)
The foundation model cheatsheet is a curated collection of tools, datasets, code examples, and papers that guide the development of foundation models (large language models, image generators, etc.). It’s designed for newer developers who want to build and release these models responsibly.
This cheatsheet was created to: 1) Help developers navigate the complex landscape of responsible foundation model development. 2) Provide guidance on mitigating potential misuses or harms of these models. 3) Highlight helpful resources that might not be widely known.
The cheatsheet includes: data catalogs (especially for less common languages), tools for searching and analyzing data, repositories for evaluating models, and papers that summarize key development decisions.
The cheatsheet is a living document! You can contribute new resources by following the instructions given on the website. Your contributions will be reviewed for relevance and quality.
It’s primarily aimed at newer foundation model developers. Larger organizations that build commercial AI products have additional factors to consider.
Currently, it’s focused on text, vision, and speech models. The creators acknowledge that this is just a starting point.