17 Finetuning Data Catalogs for Foundation Models

Finetuning or adaptation of foundation models is a complex step in model development. These models are more frequently deployed than base models. Here, we link to some useful and widely-used resources for finetuning.

Finetuning Data Catalogs for Foundation Models

Finetuning Data Catalogs

Text 12 Speech 12 Vision 3 Video 1 Tabular 1
  • AI4Bhārat Indic NLP

    A repository of Indian language text and speech resources, including datasets.

  • Arabic NLP Data Catalogue

    A catalogue of hundreds of Arabic text and speech finetuning datasets, regularly updated.

    Text Speech
  • CHiME-5 Webpage

    CHiME-5

    Speaker Diarization dataset comprising over 50 hours of conversational speech recordings collected from twenty real dinner parties that have taken place in real homes

  • Data Provenance Collection

    A repository and explorer tool for selecting popular finetuning, instruction, and alignment training datasets from Hugging Face, based on data provenance and characteristics criteria.

  • ImageNet Webpage

    ImageNet

    An image classification dataset with 1.3M samples and 1000 classes

  • Indonesian NLP Data Catalogue

    A respository of hundreds of Indonesian language datasets.

    Text Speech
  • Lanfrica Webpage

    Lanfrica

    An online catalogue that provides links to African language resources (papers and datasets) in both texts and speech

    Text Speech
  • Masakhane NLP

    A repository of African language text and speech resources, including datasets.

  • MS COCO Webpage

    MS COCO

    Object detection, segmentation, captioning and retrieval dataset

  • OpenSLR Webpage

    OpenSLR

    A collection of user-contributed datasets for various speech processing tasks

    Speech
  • SEACrowd

    A repository of hundreds of South East Asian language datasets.

  • VoxCeleb Webpage

    VoxCeleb

    Speaker Identification dataset comprising of YouTube interviews from thousands of celebrities

  • VoxLingua107 Webpage

    VoxLingua107

    Spoken language identification dataset created using audio extracted from YouTube videos retrieved using language-specific search phrases

  • Zenodo AfricaNLP Community Webpage

    Zenodo AfricaNLP Community

    An online catalogue that provides African language resources (data and models) in both texts and speech

    Text Speech
  • Aya Dataset Webpage

    Aya Dataset

    A permissively licensed multilingual instruction finetuning dataset curated by the Aya Annotation Platform from Cohere For AI. The dataset contains a total of 204k human-annotated prompt-completion pairs along with the demographics data of the annotators, spanning 65 languages.

  • FinetuneDB Webpage

    FinetuneDB

    FinetuneDB is an LLM Ops platform for customizing AI models to deliver personalized experiences at scale. We do that by helping you automate the creation of fine-tuning datasets on a per-user basis, by transforming any provided data into the right format. With our monitoring and evaluation suite, we ensure that each personalized model is aligned to your goals.

  • Croissant Website

    Croissant

    Croissant is an open-source metadata format developed by MLCommons to standardise the description of machine learning (ML) datasets, enhancing their discoverability, portability, and interoperability across various tools and platforms. It builds upon the schema.org vocabulary, extending it to encapsulate ML-specific attributes, including dataset structure, resources, and semantics. Croissant is particularly relevant in scenarios requiring consistent dataset documentation to facilitate seamless integration into ML workflows.

    Text Vision Speech Video Tabular