[Dataset on HF] [Project Page] [Subjective LeaderBoard] [Objective LeaderBoard]
CriticBench is a novel benchmark designed to comprehensively and reliably evaluate the critique abilities of Large Language Models (LLMs). These critique abilities are crucial for scalable oversight and self-improvement of LLMs. While many recent studies explore how LLMs can judge and refine flaws in their generated outputs, the measurement of critique abilities remains under-explored.
Here are the key aspects of CriticBench:
Purpose: To assess LLMs' critique abilities across four dimensions:
Tasks: CriticBench encompasses nine diverse tasks, each evaluating LLMs' critique abilities at varying levels of quality granularity.
Evaluation: The benchmark evaluates both open-source and closed-source LLMs, revealing intriguing relationships between critique abilities, response qualities, and model scales.
Resources: Datasets, resources, and an evaluation toolkit for CriticBench will be publicly released.
In summary, CriticBench aims to provide a comprehensive framework for assessing LLMs' critique and self-improvement capabilities, contributing to the advancement of large-scale language models in various applications.
Paper | Code | Results | Date | Stars |
---|