The following tools for evaluating risk serve multiple purposes: to identify if there are issues which need mitigation, to track the success of any such mitigations, to document for other users of the model what risks are still present, and to help make decisions related to model access and release.
24 Risk & Harms Evaluation Resources for Foundation Models
- Home /
- Foundation Model Resources /
- Risk & Harms Evaluation Resources for Foundation Models
Risks & Harms Evaluation
Bias Benchmark for QA (BBQ)
A dataset of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.
TextCrossmodal-3600
Image captioning evaluation with geographically diverse images in 36 languages
Text VisionFrom text to talk
Harnessing conversational corpora for humane and diversity-aware language technology. They show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action.
SpeechHallucinations
Public LLM leaderboard computed using Vectara’s Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.
TextHolisticBias
A bias and toxicity benchmark using templated sentences, covering nearly 600 descriptor terms across 13 different demographic axes, for a total of 450k examples
TextPurple Llama CyberSecEval
A benchmark for coding assistants, measuring their propensity to generate insecure code and level of compliance when asked to assist in cyberattacks.
TextRacial disparities in automated speech recognition
A discussion of racial disparities and inclusiveness in automated speech recognition.
SpeechRealToxicityPrompts
A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.
TextRed Teaming LMs with LMs
A method for using one language model to automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”)
TextSafety evaluation repository
A repository of safety evaluations, across all modalities and harms, as of late 2023. Useful for delving deeper if the following evaluations don’t meet your needs.
Text Speech VisionSimpleSafetyTests
Small probe set (100 English text prompts) covering severe harms: child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm
TextSneakyPrompt
Automated jailbreaking method to generate NSFW content even with models that have filters applied
VisionStableBias
Bias testing benchmark for Image to Text models, based on gender-occupation associations.
VisionPerspective API
Perspective API for content moderation. It has three classes of categories, each with 6+ attributes. (1) Production (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, and Threats), (2) Experimental (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, Threat, Sexually Explicit, and Flirtation), (3) NY Times (Attack on author, Attack on commenter, Incoherent, Inflammatory, Likely to Reject, Obscene, Spam, Unsubstantial).
TextMistral in-context self-reflection safety prompt
Self-reflection prompt for use as a content moderation filter. It returns a binary value (safe/not) with 13 subcategories: Illegal, Child abuse, Hate Violence Harassment, Malware, Physical Harm, Economic Harm, Fraud, Adult, Political campaigning or lobbying, Privacy invasion, Unqualified law advice, Unqualified financial advice, Unqualified health advice
TextGoogle, Gemini API Safety Filters (via Vertex)
Safety filter for Gemini models, available through Vertex. 4 safety attributes are described: Hate speech, Harassment, Sexually Explicit, and Dangerous Content. Probabilities are returned for each attribute (Negligible, Low, Medium, High).
TextGoogle, PaLM API Safety Filters (via Vertex)
Safety filter for PaLM models, available through Vertex. 16 safety attributes are described (some of which are ’topical’ rather than purely safety risks): Derogatory, Toxic, Violent, Sexual, Insult, Profanity, Death Harm & Tragedy, Firearms & Weapons, Public safety, Health, Religion & belief, Illicit drugs, War & conflict, Politics, Finance, Legal.
TextAnthropic content moderation prompt
In-context prompt for assessing whether messages and responses contain inappropriate content: “violent, illegal or pornographic activities”
Text Speech VisionCohere in-context content moderation prompt
Few-shot prompt for classifying whether text is toxic or not.
TextModel Risk Cards
A framework for structured assessment and documentation of risks associated with an application of language models. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. The paper also describes 70+ risks identified from a literature survey.
Text Speech Vision