TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Overall - Test	JEEBench	GPT-4+CoT+Self-Consistency@8	Score	0.389	# 1
Overall - Test	JEEBench	GPT-4+CoT+Self-Critique	Score	0.339	# 3
Overall - Test	JEEBench	GPT-4+(1-shot CoT)	Score	0.292	# 5
Overall - Test	JEEBench	GPT-4+CoT	Score	0.350	# 2
Overall - Test	JEEBench	GPT-4	Score	0.309	# 4
Overall - Test	JEEBench	GPT-3.5	Score	0.177	# 7
Overall - Test	JEEBench	PaLM2	Score	0.153	# 8
Overall - Test	JEEBench	GPT-3	Score	0.122	# 9
Overall - Test	JEEBench	Falcon7B-Instruct	Score	0.098	# 11
Overall - Test	JEEBench	Alpaca-LoRA	Score	0.089	# 12
Overall - Test	JEEBench	Random	Score	0.105	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/have-llms-advanced-enough-a-challenging/overall-test-on-jeebench)](https://paperswithcode.com/sota/overall-test-on-jeebench?p=have-llms-advanced-enough-a-challenging)`

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

24 May 2023 · Daman Arora, Himanshu Gaurav Singh, Mausam ·

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts. We also observe that by mere prompting, GPT-4 is unable to assess risk introduced by negative marking for incorrect answers. For this, we develop a post-hoc confidence-thresholding method over self-consistency, which enables effective response selection. We hope that our challenging benchmark will guide future re-search in problem-solving using LLMs.

PDF Abstract

Code

Add Remove Mark official

hgaurav2k/jeebench official

Tasks

Add Remove

Overall - Test

Datasets

Introduced in the Paper:

JEEBench

Used in the Paper:

MMLU

GSM8K

MATH

ScienceQA

SciQ AQUA-RAT

SciBench

Results from the Paper

Edit

Ranked #1 on Overall - Test on JEEBench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Overall - Test	JEEBench	GPT-4+CoT+Self-Consistency@8	Score	0.389	# 1	Compare
Overall - Test	JEEBench	GPT-4+CoT+Self-Critique	Score	0.339	# 3	Compare
Overall - Test	JEEBench	GPT-4+(1-shot CoT)	Score	0.292	# 5	Compare
Overall - Test	JEEBench	GPT-4+CoT	Score	0.350	# 2	Compare
Overall - Test	JEEBench	GPT-4	Score	0.309	# 4	Compare
Overall - Test	JEEBench	GPT-3.5	Score	0.177	# 7	Compare
Overall - Test	JEEBench	PaLM2	Score	0.153	# 8	Compare
Overall - Test	JEEBench	GPT-3	Score	0.122	# 9	Compare
Overall - Test	JEEBench	Falcon7B-Instruct	Score	0.098	# 11	Compare
Overall - Test	JEEBench	Alpaca-LoRA	Score	0.089	# 12	Compare
Overall - Test	JEEBench	Random	Score	0.105	# 10	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Discriminative Fine-Tuning • Dropout • GELU • GPT • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Weight Decay

Edit Social Preview

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove