Many modern foundation models are released with general abilities, such that their use cases are poorly specified and open-ended, posing significant challenges to evaluation benchmarks which are unable to critically evaluate so many tasks, applications, and risks systematically or fairly. It is important to carefully scope the original intentions for the model, and the evaluations to those intentions.
22 Resources for Model Evaluation Capabilities
- Home /
- Foundation Model Resources /
- Resources for Model Evaluation Capabilities
Capabilities
BigBench Hard
A challenging subset of 23 BigBench tasks where at time of release models did not outperform annotator performance.
TextBigCode Evaluation Harness
A framework for the evaluation of code generation models, compiling many evaluation sets.
TextHEIM
A large suite of text-to-image evaluations. Useful for thorough capability analysis of these model types.
Text VisionHELM classic
A large suite of benchmarks and metric types, to holistically evaluate many model qualities aside from performance on general tasks. Useful for a thorough comparison against other well known models.
TextHugging Face Leaderboards Collection
A collection of unique leaderboards on Hugging Face for ranking models across modalities and tasks.
Text Speech VisionHumanEvalPack
HumanEvalPack is a code evaluation benchmark across 6 languages and 3 tasks, extending OpenAI’s HumanEval.
TextLighteval
Small, highly configurable LLM evaluation library, for fast experimentation and iteration.
TextLM Evaluation Harness
Orchestration framework for standardizing LM prompted evaluation, supporting hundreds of subtasks.
TextLMSys Chatbot Arena
A leaderboard of models based on Elo ratings where humans or models select their preferred response between two anonymous models. Chatbot Arena, MT-Bench, and 5-shot MMLU are used as benchmarks. This resource provides a general purpose, and GPT-4 biased perspective into model capabilities.
TextMMBench
A joint vision and text benchmark evaluating dozens of capabilities, using curated datasets and ChatGPT in the loop.
Text VisionMME
An evaluation benchmark for multimodal large language models with 14 manually curated subtasks, to avoid data leakage.
Text VisionMTEB
The Massive Text Embedding Benchmark measures the quality of embeddings across 58 datasets and 112 languages for tasks related to retrieval, classification, clustering or semantic similarity.
TextOpenASR Leaderboard
An automatic leaderboard ranking and evaluating speech recognition models on common benchmarks.
SpeechOpen LLM Leaderboard
A popular leaderboard on Hugging Face for ranking open LLMs on their knowledge, reasoning and math capabilities.
TextSWE Bench
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
TextThe Edinburgh International Accents of English Corpus
Benchmark dataset of diverse English varieties for evaluating automatic speech recognition models (typically trained and tested only on US English)
SpeechHELM lite
A lightweight subset of capability-centric benchmarks within HELM with comparisons to many prominent open and closed models.
TextMMMU
A benchmark to evaluate joint text and vision models on 11k examples spanning 30 college-level subject domains.
Text VisionSIB-200
A large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects.
Text