Resources for Model Evaluation Capabilities

Capabilities

Add Resource

Text 23 Speech 3 Vision 9

BigBench Hard
A challenging subset of 23 BigBench tasks where at time of release models did not outperform annotator performance.
Text
BigCode Evaluation Harness
A framework for the evaluation of code generation models, compiling many evaluation sets.
- GitHub
Text
CLIP benchmark
Image classification, retrieval and captioning
- GitHub
Text Vision
DataComp eval suite
38 image classification and retrieval downstream tasks
Text Vision
HEIM
A large suite of text-to-image evaluations. Useful for thorough capability analysis of these model types.
- Website
Text Vision
HELM classic
A large suite of benchmarks and metric types, to holistically evaluate many model qualities aside from performance on general tasks. Useful for a thorough comparison against other well known models.
Text
Hugging Face Leaderboards Collection
A collection of unique leaderboards on Hugging Face for ranking models across modalities and tasks.
Text Speech Vision
HumanEvalPack
HumanEvalPack is a code evaluation benchmark across 6 languages and 3 tasks, extending OpenAI’s HumanEval.
Text
Lighteval
Small, highly configurable LLM evaluation library, for fast experimentation and iteration.
- GitHub
Text
LM Evaluation Harness
Orchestration framework for standardizing LM prompted evaluation, supporting hundreds of subtasks.
- GitHub
Text
LMSys Chatbot Arena
A leaderboard of models based on Elo ratings where humans or models select their preferred response between two anonymous models. Chatbot Arena, MT-Bench, and 5-shot MMLU are used as benchmarks. This resource provides a general purpose, and GPT-4 biased perspective into model capabilities.
Text
MMBench
A joint vision and text benchmark evaluating dozens of capabilities, using curated datasets and ChatGPT in the loop.
Text Vision
MME
An evaluation benchmark for multimodal large language models with 14 manually curated subtasks, to avoid data leakage.
Text Vision
MTEB
The Massive Text Embedding Benchmark measures the quality of embeddings across 58 datasets and 112 languages for tasks related to retrieval, classification, clustering or semantic similarity.
Text
OpenASR Leaderboard
An automatic leaderboard ranking and evaluating speech recognition models on common benchmarks.
Speech
OpenFlamingo eval suite
VQA, captioning, classification
Text Vision
Open LLM Leaderboard
A popular leaderboard on Hugging Face for ranking open LLMs on their knowledge, reasoning and math capabilities.
- Hugging Face
Text
SWE Bench
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
Text
The Edinburgh International Accents of English Corpus
Benchmark dataset of diverse English varieties for evaluating automatic speech recognition models (typically trained and tested only on US English)
Speech
HELM lite
A lightweight subset of capability-centric benchmarks within HELM with comparisons to many prominent open and closed models.
Text
MMMU
A benchmark to evaluate joint text and vision models on 11k examples spanning 30 college-level subject domains.
Text Vision
SIB-200
A large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects.
Text
BigBench Hard
A challenging subset of 23 BigBench tasks where at time of release models did not outperform annotator performance.
Text
BigCode Evaluation Harness
A framework for the evaluation of code generation models, compiling many evaluation sets.
- GitHub
Text
CLIP benchmark
Image classification, retrieval and captioning
- GitHub
Text Vision

25 Resources for Model Evaluation Capabilities

Capabilities