TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Arithmetic Reasoning	GSM8K	ChatGPT (Ask, Refine, Trust)	Accuracy	82.6	# 51
Arithmetic Reasoning	GSM8K	GPT-4 (Ask, Refine, Trust)	Accuracy	94.08	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-art-of-llm-refinement-ask-refine-and/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=the-art-of-llm-refinement-ask-refine-and)`

The ART of LLM Refinement: Ask, Refine, and Trust

14 Nov 2023 · Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz ·

In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with refinement objective called ART: Ask, Refine, and Trust, which asks necessary questions to decide when an LLM should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), ART achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

GSM8K

Question Answering

StrategyQA

Datasets

GSM8K

StrategyQA

Results from the Paper

Add Remove

Ranked #12 on Arithmetic Reasoning on GSM8K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Arithmetic Reasoning	GSM8K	ChatGPT (Ask, Refine, Trust)	Accuracy	82.6	# 51	Compare
Arithmetic Reasoning	GSM8K	GPT-4 (Ask, Refine, Trust)	Accuracy	94.08	# 12	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

The ART of LLM Refinement: Ask, Refine, and Trust

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove