NPHardEval Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

**NPHardEval** is a dynamic benchmark designed to assess the reasoning abilities of **Large Language Models (LLMs)** across a broad spectrum of algorithmic questions. Let's delve into the details:

1. **Benchmark Purpose**:
   - **Complex Reasoning Ability**: One of the most crucial features of current LLMs is their ability to handle complex reasoning. This capability plays an integral role in complex decision-making tasks.
   - **Inadequacy of Existing Benchmarks**: While several benchmarks exist to evaluate LLMs' reasoning abilities, they fall short in providing a rigorous assessment of the full extent of these abilities. Additionally, publicly accessible and static benchmarks risk overfitting, allowing models to tailor their responses to specific metrics.
   - **Introducing NPHardEval**: To address these limitations, NPHardEval was introduced. It aims to rigorously evaluate LLMs' reasoning abilities by extending up to the **NP-Hard complexity class**.

2. **Key Features of NPHardEval**:
   - **900 Algorithmic Questions**: NPHardEval includes a diverse set of 900 algorithmic questions, carefully chosen to represent a wide range of complexity classes below NP-Hard. These questions serve as a rigorous measure of LLMs' reasoning abilities.
   - **Dynamic Update Mechanism**: Unlike static benchmarks, NPHardEval dynamically updates its datapoints on a monthly basis. Regular updates mitigate the risk of overfitting, ensuring a more accurate and reliable assessment of LLMs' reasoning capabilities.

3. **Research Contribution**:
   - **Objective Perspective**: NPHardEval sheds light on the current state of reasoning in LLMs by comparing their performance across complex classes.
   - **Available Resources**: The benchmark dataset and code for NPHardEval are accessible [here](https://arxiv.org/abs/2312.14890) ¹.

In summary, NPHardEval provides a comprehensive evaluation framework for assessing LLMs' reasoning abilities through the lens of computational complexity classes. 🌟

(1) NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language .... https://arxiv.org/abs/2312.14890.
(2) NPHardEval/README.md at main · casmlab/NPHardEval · GitHub. https://github.com/casmlab/NPHardEval/blob/main/README.md.
(3) NPHardEval: Benchmarking Reasoning Ability of Large Language Models via .... https://frankling2020.github.io/publication/nphardeval/.
(4) undefined. https://doi.org/10.48550/arXiv.2312.14890.

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

NPHardEval

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

NPHardEval4V

Usage

License

Modalities

Languages

NPHardEval

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

NPHardEval4V

Usage

License Edit

Modalities Edit

Languages Edit