Finetuning or adaptation of foundation models is a complex step in model development. These models are more frequently deployed than base models. Here, we link to some useful and widely-used resources for finetuning.

Finetuning or adaptation of foundation models is a complex step in model development. These models are more frequently deployed than base models. Here, we link to some useful and widely-used resources for finetuning.
A repository of Indian language text and speech resources, including datasets.
A catalogue of hundreds of Arabic text and speech finetuning datasets, regularly updated.
Speaker Diarization dataset comprising over 50 hours of conversational speech recordings collected from twenty real dinner parties that have taken place in real homes
A repository and explorer tool for selecting popular finetuning, instruction, and alignment training datasets from Hugging Face, based on data provenance and characteristics criteria.
An online catalogue that provides links to African language resources (papers and datasets) in both texts and speech
A repository of African language text and speech resources, including datasets.
Speaker Identification dataset comprising of YouTube interviews from thousands of celebrities
Spoken language identification dataset created using audio extracted from YouTube videos retrieved using language-specific search phrases
An online catalogue that provides African language resources (data and models) in both texts and speech
A permissively licensed multilingual instruction finetuning dataset curated by the Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators, spanning 65 languages.
FinetuneDB is an LLM Ops platform for customizing AI models to deliver personalized experiences at scale. We do that by helping you automate the creation of fine-tuning datasets on a per-user basis, by transforming any provided data into the right format. With our monitoring and evaluation suite, we ensure that each personalized model is aligned to your goals.
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.