TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Arithmetic Reasoning	GSM8K	Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	61	# 104
Arithmetic Reasoning	GSM8K	Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	Parameters (Billion)	70	# 86
Arithmetic Reasoning	GSM8K	Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	43	# 125
Arithmetic Reasoning	GSM8K	Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	Parameters (Billion)	13	# 53
Arithmetic Reasoning	GSM8K	Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	41	# 127
Arithmetic Reasoning	GSM8K	Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	Parameters (Billion)	7	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/the-unreasonable-effectiveness-of-eccentric/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=the-unreasonable-effectiveness-of-eccentric)`

The Unreasonable Effectiveness of Eccentric Automatic Prompts

9 Feb 2024 · Rick Battle, Teja Gollapudi ·

Large Language Models (LLMs) have demonstrated remarkable problem-solving and basic mathematics abilities. However, their efficacy is highly contingent on the formulation of the prompt. This study endeavors to quantify the influence of incorporating "positive thinking" into the system message of the prompt, then compare that to systematic prompt optimization. We assess the performance of 60 combinations of system message snippets, tested with and without Chain of Thought prompting, across three models with parameters ranging from 7 to 70 billion on the GSM8K dataset. Our findings reveal that results do not universally generalize across models. In most instances, the inclusion of "positive thinking" prompts positively affected model performance. Notably, however, Llama2-70B exhibited an exception when not utilizing Chain of Thought, as the optimal system message was found to be none at all. Given the combinatorial complexity, and thus computation time, of experimenting with hand-tuning prompts for large black-box models, we then compared the performance of the best "positive thinking" prompt against the output of systematic prompt optimization. We show that employing an automated prompt optimizer emerges as the most effective method for enhancing performance, even when working with smaller open-source models. Additionally, our findings reveal that the highest-scoring, automatically-optimized prompt exhibits a degree of peculiarity far beyond expectations.

PDF Abstract

Code

Add Remove Mark official

stanfordnlp/dspy official

10,597

Tasks

Add Remove

Arithmetic Reasoning

GSM8K

Datasets

GSM8K

Results from the Paper

Add Remove

Ranked #104 on Arithmetic Reasoning on GSM8K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Arithmetic Reasoning	GSM8K	Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	61	# 104	Compare
Arithmetic Reasoning	GSM8K		Parameters (Billion)	70	# 86	Compare
Arithmetic Reasoning	GSM8K	Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	43	# 125	Compare
Arithmetic Reasoning	GSM8K		Parameters (Billion)	13	# 53	Compare
Arithmetic Reasoning	GSM8K	Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting)	Accuracy	41	# 127	Compare
Arithmetic Reasoning	GSM8K		Parameters (Billion)	7	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

The Unreasonable Effectiveness of Eccentric Automatic Prompts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove