17 Finetuning Data Catalogs for Foundation Models

Finetuning Data Catalogs

Add Resource

Text 12 Speech 12 Vision 3 Video 1 Tabular 1

AI4Bhārat Indic NLP
A repository of Indian language text and speech resources, including datasets.
Text Speech
Arabic NLP Data Catalogue
A catalogue of hundreds of Arabic text and speech finetuning datasets, regularly updated.
Text Speech
CHiME-5
Speaker Diarization dataset comprising over 50 hours of conversational speech recordings collected from twenty real dinner parties that have taken place in real homes
Speech
Data Provenance Collection
A repository and explorer tool for selecting popular finetuning, instruction, and alignment training datasets from Hugging Face, based on data provenance and characteristics criteria.
Text
ImageNet
An image classification dataset with 1.3M samples and 1000 classes
Vision
Indonesian NLP Data Catalogue
A respository of hundreds of Indonesian language datasets.
Text Speech
Lanfrica
An online catalogue that provides links to African language resources (papers and datasets) in both texts and speech
- Website
Text Speech
Masakhane NLP
A repository of African language text and speech resources, including datasets.
Text Speech
MS COCO
Object detection, segmentation, captioning and retrieval dataset
Text Vision
OpenSLR
A collection of user-contributed datasets for various speech processing tasks
- Website
Speech
SEACrowd
A repository of hundreds of South East Asian language datasets.
Text Speech
VoxCeleb
Speaker Identification dataset comprising of YouTube interviews from thousands of celebrities
Speech
VoxLingua107
Spoken language identification dataset created using audio extracted from YouTube videos retrieved using language-specific search phrases
Speech
Zenodo AfricaNLP Community
An online catalogue that provides African language resources (data and models) in both texts and speech
- Website
Text Speech
Aya Dataset
A permissively licensed multilingual instruction finetuning dataset curated by the Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators, spanning 65 languages.
Text
FinetuneDB
FinetuneDB is an LLM Ops platform for customizing AI models to deliver personalized experiences at scale. We do that by helping you automate the creation of fine-tuning datasets on a per-user basis, by transforming any provided data into the right format. With our monitoring and evaluation suite, we ensure that each personalized model is aligned to your goals.
- Website
Text
Croissant
Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.
Text Vision Speech Video Tabular