open-compass/CriticBench Dataset

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

[![arXiv](https://img.shields.io/badge/arXiv-2307.04725-b31b1b.svg)](https://arxiv.org/abs/2402.13764)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)

[[Dataset on HF](https://huggingface.co/datasets/opencompass/CriticBench)]
[[Project Page](https://open-compass.github.io/CriticBench/)]
[[Subjective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_subjective.html)]
[[Objective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_objective.html)]

**CriticBench** is a novel benchmark designed to comprehensively and reliably evaluate the critique abilities of **Large Language Models (LLMs)**. These critique abilities are crucial for scalable oversight and self-improvement of LLMs. While many recent studies explore how LLMs can judge and refine flaws in their generated outputs, the measurement of critique abilities remains under-explored.

Here are the key aspects of **CriticBench**:

1. **Purpose**: To assess LLMs' critique abilities across four dimensions:
    - **Feedback**: How well an LLM provides constructive feedback.
    - **Comparison**: The ability to compare and contrast different responses.
    - **Refinement**: How effectively an LLM can refine flawed or suboptimal outputs.
    - **Meta-feedback**: The LLM's ability to reflect on its own performance.

2. **Tasks**: CriticBench encompasses **nine diverse tasks**, each evaluating LLMs' critique abilities at varying levels of quality granularity.

3. **Evaluation**: The benchmark evaluates both open-source and closed-source LLMs, revealing intriguing relationships between critique abilities, response qualities, and model scales.

4. **Resources**: Datasets, resources, and an evaluation toolkit for CriticBench will be publicly released.

In summary, CriticBench aims to provide a comprehensive framework for assessing LLMs' critique and self-improvement capabilities, contributing to the advancement of large-scale language models in various applications.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

open-compass/CriticBench

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

open-compass/CriticBench

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit