Finetuning or adaptation of foundation models is a complex step in model development. These models are more frequently deployed than base models. Here, we link to some useful and widely-used resources for finetuning.
15 Finetuning Data Catalogs for Foundation Models
- Home /
- Foundation Model Resources /
- Finetuning Data Catalogs for Foundation Models
Finetuning Data Catalogs
AI4Bhārat Indic NLP
A repository of Indian language text and speech resources, including datasets.
Text SpeechArabic NLP Data Catalogue
A catalogue of hundreds of Arabic text and speech finetuning datasets, regularly updated.
Text SpeechCHiME-5
Speaker Diarization dataset comprising over 50 hours of conversational speech recordings collected from twenty real dinner parties that have taken place in real homes
SpeechData Provenance Collection
A repository and explorer tool for selecting popular finetuning, instruction, and alignment training datasets from Hugging Face, based on data provenance and characteristics criteria.
TextLanfrica
An online catalogue that provides links to African language resources (papers and datasets) in both texts and speech
Text SpeechMasakhane NLP
A repository of African language text and speech resources, including datasets.
Text SpeechVoxCeleb
Speaker Identification dataset comprising of YouTube interviews from thousands of celebrities
SpeechVoxLingua107
Spoken language identification dataset created using audio extracted from YouTube videos retrieved using language-specific search phrases
SpeechZenodo AfricaNLP Community
An online catalogue that provides African language resources (data and models) in both texts and speech
Text SpeechAya Dataset
A permissively licensed multilingual instruction finetuning dataset curated by the Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators, spanning 65 languages.
Text