NPHardEval is a dynamic benchmark designed to assess the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of algorithmic questions. Let's delve into the details:

  1. Benchmark Purpose:
  2. Complex Reasoning Ability: One of the most crucial features of current LLMs is their ability to handle complex reasoning. This capability plays an integral role in complex decision-making tasks.
  3. Inadequacy of Existing Benchmarks: While several benchmarks exist to evaluate LLMs' reasoning abilities, they fall short in providing a rigorous assessment of the full extent of these abilities. Additionally, publicly accessible and static benchmarks risk overfitting, allowing models to tailor their responses to specific metrics.
  4. Introducing NPHardEval: To address these limitations, NPHardEval was introduced. It aims to rigorously evaluate LLMs' reasoning abilities by extending up to the NP-Hard complexity class.

  5. Key Features of NPHardEval:

  6. 900 Algorithmic Questions: NPHardEval includes a diverse set of 900 algorithmic questions, carefully chosen to represent a wide range of complexity classes below NP-Hard. These questions serve as a rigorous measure of LLMs' reasoning abilities.
  7. Dynamic Update Mechanism: Unlike static benchmarks, NPHardEval dynamically updates its datapoints on a monthly basis. Regular updates mitigate the risk of overfitting, ensuring a more accurate and reliable assessment of LLMs' reasoning capabilities.

  8. Research Contribution:

  9. Objective Perspective: NPHardEval sheds light on the current state of reasoning in LLMs by comparing their performance across complex classes.
  10. Available Resources: The benchmark dataset and code for NPHardEval are accessible here ¹.

In summary, NPHardEval provides a comprehensive evaluation framework for assessing LLMs' reasoning abilities through the lens of computational complexity classes. 🌟

(1) NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language .... https://arxiv.org/abs/2312.14890. (2) NPHardEval/README.md at main · casmlab/NPHardEval · GitHub. https://github.com/casmlab/NPHardEval/blob/main/README.md. (3) NPHardEval: Benchmarking Reasoning Ability of Large Language Models via .... https://frankling2020.github.io/publication/nphardeval/. (4) undefined. https://doi.org/10.48550/arXiv.2312.14890.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages