24 Risk & Harms Evaluation Resources for Foundation Models

The following tools for evaluating risk serve multiple purposes: to identify if there are issues which need mitigation, to track the success of any such mitigations, to document for other users of the model what risks are still present, and to help make decisions related to model access and release.

Risk & Harms Evaluation Resources for Foundation Models

Risks & Harms Evaluation

Text 20 Speech 5 Vision 6
  • Bias Benchmark for QA (BBQ)

    A dataset of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.

  • Crossmodal-3600


    Image captioning evaluation with geographically diverse images in 36 languages

  • FactualityPrompt

    A benchmark to measure factuality in language models.

  • From text to talk

    From text to talk

    Harnessing conversational corpora for humane and diversity-aware language technology. They show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action.

  • Hallucinations


    Public LLM leaderboard computed using Vectara’s Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.

  • HolisticBias

    A bias and toxicity benchmark using templated sentences, covering nearly 600 descriptor terms across 13 different demographic axes, for a total of 450k examples

  • Purple Llama CyberSecEval

    A benchmark for coding assistants, measuring their propensity to generate insecure code and level of compliance when asked to assist in cyberattacks.

  • Purple Llama Guard

    A tool to identify and protect against malicious inputs to LLMs.

  • Racial disparities in automated speech recognition

    Racial disparities in automated speech recognition

    A discussion of racial disparities and inclusiveness in automated speech recognition.

  • RealToxicityPrompts

    A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.

  • Red Teaming LMs with LMs

    Red Teaming LMs with LMs

    A method for using one language model to automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”)

  • Safety evaluation repository

    Safety evaluation repository

    A repository of safety evaluations, across all modalities and harms, as of late 2023. Useful for delving deeper if the following evaluations don’t meet your needs.

    Text Speech Vision
  • SimpleSafetyTests

    Small probe set (100 English text prompts) covering severe harms: child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm

  • SneakyPrompt

    Automated jailbreaking method to generate NSFW content even with models that have filters applied

  • StableBias


    Bias testing benchmark for Image to Text models, based on gender-occupation associations.

  • Perspective API

    Perspective API

    Perspective API for content moderation. It has three classes of categories, each with 6+ attributes. (1) Production (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, and Threats), (2) Experimental (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, Threat, Sexually Explicit, and Flirtation), (3) NY Times (Attack on author, Attack on commenter, Incoherent, Inflammatory, Likely to Reject, Obscene, Spam, Unsubstantial).

  • Mistral in-context self-reflection safety prompt

    Mistral in-context self-reflection safety prompt

    Self-reflection prompt for use as a content moderation filter. It returns a binary value (safe/not) with 13 subcategories: Illegal, Child abuse, Hate Violence Harassment, Malware, Physical Harm, Economic Harm, Fraud, Adult, Political campaigning or lobbying, Privacy invasion, Unqualified law advice, Unqualified financial advice, Unqualified health advice

  • Google, Gemini API Safety Filters (via Vertex)

    Google, Gemini API Safety Filters (via Vertex)

    Safety filter for Gemini models, available through Vertex. 4 safety attributes are described: Hate speech, Harassment, Sexually Explicit, and Dangerous Content. Probabilities are returned for each attribute (Negligible, Low, Medium, High).

  • Google, PaLM API Safety Filters (via Vertex)

    Google, PaLM API Safety Filters (via Vertex)

    Safety filter for PaLM models, available through Vertex. 16 safety attributes are described (some of which are ’topical’ rather than purely safety risks): Derogatory, Toxic, Violent, Sexual, Insult, Profanity, Death Harm & Tragedy, Firearms & Weapons, Public safety, Health, Religion & belief, Illicit drugs, War & conflict, Politics, Finance, Legal.

  • Anthropic content moderation prompt

    Anthropic content moderation prompt

    In-context prompt for assessing whether messages and responses contain inappropriate content: “violent, illegal or pornographic activities”

    Text Speech Vision
  • Cohere in-context content moderation prompt

    Cohere in-context content moderation prompt

    Few-shot prompt for classifying whether text is toxic or not.

  • NVidia NeMo Guardrails

    Open-source tooling to create guardrails for LLM applications.

  • SafetyPrompts


    Open repository of datasets for LLM safety

  • Model Risk Cards

    A framework for structured assessment and documentation of risks associated with an application of language models. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. The paper also describes 70+ risks identified from a literature survey.

    Text Speech Vision