open-compass/CriticBench

Introduced by Lan et al. in CriticBench: Evaluating Large Language Models as Critic

arXiv license

[Dataset on HF] [Project Page] [Subjective LeaderBoard] [Objective LeaderBoard]

CriticBench is a novel benchmark designed to comprehensively and reliably evaluate the critique abilities of Large Language Models (LLMs). These critique abilities are crucial for scalable oversight and self-improvement of LLMs. While many recent studies explore how LLMs can judge and refine flaws in their generated outputs, the measurement of critique abilities remains under-explored.

Here are the key aspects of CriticBench:

  1. Purpose: To assess LLMs' critique abilities across four dimensions:

    • Feedback: How well an LLM provides constructive feedback.
    • Comparison: The ability to compare and contrast different responses.
    • Refinement: How effectively an LLM can refine flawed or suboptimal outputs.
    • Meta-feedback: The LLM's ability to reflect on its own performance.
  2. Tasks: CriticBench encompasses nine diverse tasks, each evaluating LLMs' critique abilities at varying levels of quality granularity.

  3. Evaluation: The benchmark evaluates both open-source and closed-source LLMs, revealing intriguing relationships between critique abilities, response qualities, and model scales.

  4. Resources: Datasets, resources, and an evaluation toolkit for CriticBench will be publicly released.

In summary, CriticBench aims to provide a comprehensive framework for assessing LLMs' critique and self-improvement capabilities, contributing to the advancement of large-scale language models in various applications.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Modalities


Languages