CriticBench is a comprehensive benchmark designed to assess the abilities of Large Language Models (LLMs) to critique and rectify their reasoning across various tasks. It encompasses five reasoning domains:

  1. Mathematical
  2. Commonsense
  3. Symbolic
  4. Coding
  5. Algorithmic

CriticBench compiles 15 datasets and incorporates responses from three LLM families. By utilizing CriticBench, researchers evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning (referred to as GQC reasoning). Notable findings include:

  1. A linear relationship in GQC capabilities, with critique-focused training significantly enhancing performance.
  2. Task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction.
  3. GQC knowledge inconsistencies that decrease as model size increases.
  4. An intriguing inter-model critiquing dynamic, where stronger models excel at critiquing weaker ones, while weaker models surprisingly surpass stronger ones in self-critique.

(1) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://arxiv.org/abs/2402.14809. (2) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. http://export.arxiv.org/abs/2402.14809. (3) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://openreview.net/forum?id=sc5i7q6DQO. (4) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning - arXiv.org. https://arxiv.org/html/2402.14809v2. (5) undefined. https://doi.org/10.48550/arXiv.2402.14809.

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets


License


  • MIT

Modalities


Languages