TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Challenge)	FLAN 137B (zero-shot)	Accuracy	63.1	# 19
Common Sense Reasoning	ARC (Challenge)	FLAN 137B (few-shot, k=13)	Accuracy	63.8	# 18
Common Sense Reasoning	ARC (Easy)	FLAN 137B (0-shot)	Accuracy	79.6	# 14
Common Sense Reasoning	ARC (Easy)	FLAN 137B (few-shot, k=14)	Accuracy	80.7	# 10
Question Answering	BoolQ	FLAN 137B (prompt-tuned)	Accuracy	86.3	# 13
Question Answering	BoolQ	FLAN 137B (4-shot)	Accuracy	84.6	# 18
Question Answering	BoolQ	FLAN 137B (0-shot)	Accuracy	82.9	# 23
Question Answering	COPA	FLAN 137B (prompt-tuned)	Accuracy	94	# 10
Question Answering	COPA	FLAN 137B (few-shot, k=16)	Accuracy	87	# 22
Question Answering	COPA	FLAN 137B (zero-shot)	Accuracy	91	# 13
Sentence Completion	HellaSwag	FLAN 137B (3-shot)	Accuracy	59.2	# 56
Sentence Completion	HellaSwag	FLAN 137B (0-shot)	Accuracy	56.7	# 58
Sentiment Analysis	IMDb	FLAN 137B (few-shot, k=2)	Accuracy	95	# 17
Sentiment Analysis	IMDb	FLAN 137B (zero-shot)	Accuracy	94.3	# 21
Question Answering	MultiRC	FLAN 137B (zero-shot)	F1	77.5	# 12
Question Answering	MultiRC	FLAN 137B (prompt-tuned)	F1	83.4	# 11
Question Answering	MultiRC	FLAN 137B (1-shot)	F1	72.1	# 14
Question Answering	NaturalQA	FLAN 137B (zero-shot)	EM	20.7	# 2
Question Answering	OBQA	FLAN 137B (few-shot, k=16)	Accuracy	78.2	# 2
Question Answering	OBQA	FLAN 137B (zero-shot)	Accuracy	78.4	# 1
Question Answering	PIQA	FLAN 137B (0-shot)	Accuracy	80.5	# 26
Question Answering	PIQA	FLAN 137B (few-shot, k=10)	Accuracy	81.7	# 22
Common Sense Reasoning	ReCoRD	FLAN 137B (prompt-tuned)	EM	85.1	# 13
Common Sense Reasoning	ReCoRD	FLAN 137B (zero-shot)	EM	72.5	# 23
Natural Language Inference	RTE	FLAN 137B (0-shot)	Accuracy	84.1%	# 30
Natural Language Inference	RTE	FLAN 137B (prompt-tuned)	Accuracy	91.7%	# 13
Natural Language Inference	RTE	FLAN 137B (8-shot)	Accuracy	84.5%	# 29
Question Answering	StoryCloze	FLAN 137B (few-shot, k=10)	Accuracy	94.7	# 3
Question Answering	StoryCloze	FLAN 137B (zero-shot)	Accuracy	93.4	# 6
Question Answering	TriviaQA	FLAN 137B (zero-shot)	EM	56.7	# 32
Coreference Resolution	Winograd Schema Challenge	FLAN 137B (zero-shot)	Accuracy	80.8	# 20
Coreference Resolution	Winograd Schema Challenge	FLAN 137B (prompt-tuned)	Accuracy	86.5	# 15
Common Sense Reasoning	WinoGrande	FLAN 137B (few-shot, k=16)	Accuracy	72.8	# 30
Common Sense Reasoning	WinoGrande	FLAN 137B (0-shot)	Accuracy	71.2	# 32
Machine Translation	WMT2014 English-French	FLAN 137B (zero-shot)	BLEU score	33.9	# 48
Machine Translation	WMT2014 English-French	FLAN 137B (few-shot, k=9)	BLEU score	33.8	# 49
Machine Translation	WMT2014 French-English	FLAN 137B (zero-shot)	BLEU score	35.9	# 2
Machine Translation	WMT2014 French-English	FLAN 137B (few-shot, k=9)	BLEU score	37.9	# 1
Machine Translation	WMT2016 English-German	FLAN 137B (few-shot, k=11)	BLEU score	26.1	# 7
Machine Translation	WMT2016 English-German	FLAN 137B (zero-shot)	BLEU score	27.0	# 5
Machine Translation	WMT2016 English-Romanian	FLAN 137B (zero-shot)	BLEU score	18.9	# 20
Machine Translation	WMT2016 English-Romanian	FLAN 137B (few-shot, k=9)	BLEU score	20.5	# 19
Machine Translation	WMT2016 German-English	FLAN 137B (few-shot, k=11)	BLEU score	40.7	# 1
Machine Translation	WMT2016 German-English	FLAN 137B (zero-shot)	BLEU score	38.9	# 2
Machine Translation	WMT2016 Romanian-English	FLAN 137B (few-shot, k=9)	BLEU score	38.1	# 2
Machine Translation	WMT2016 Romanian-English	FLAN 137B (zero-shot)	BLEU score	37.3	# 3
Natural Language Inference	WNLI	FLAN 137B (zero-shot)	Accuracy	74.6	# 14
Natural Language Inference	WNLI	FLAN 137B (few-shot, k=4)	Accuracy	70.4	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-obqa)](https://paperswithcode.com/sota/question-answering-on-obqa?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2014-french-english)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-french-english?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2016-german-english)](https://paperswithcode.com/sota/machine-translation-on-wmt2016-german-english?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-naturalqa)](https://paperswithcode.com/sota/question-answering-on-naturalqa?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2016-romanian)](https://paperswithcode.com/sota/machine-translation-on-wmt2016-romanian?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-storycloze)](https://paperswithcode.com/sota/question-answering-on-storycloze?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2016-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2016-english-german?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-multirc)](https://paperswithcode.com/sota/question-answering-on-multirc?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/natural-language-inference-on-wnli)](https://paperswithcode.com/sota/natural-language-inference-on-wnli?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/sentiment-analysis-on-imdb)](https://paperswithcode.com/sota/sentiment-analysis-on-imdb?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2016-english-1)](https://paperswithcode.com/sota/machine-translation-on-wmt2016-english-1?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/question-answering-on-triviaqa)](https://paperswithcode.com/sota/question-answering-on-triviaqa?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=finetuned-language-models-are-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/finetuned-language-models-are-zero-shot/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=finetuned-language-models-are-zero-shot)`

Finetuned Language Models Are Zero-Shot Learners

ICLR 2022 · Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le ·

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Code

Add Remove Mark official

google-research/flan official

1,397

hiyouga/llama-efficient-tuning

↳ Quickstart in

Colab

Spaces

20,086

bigcode-project/starcoder

↳ Quickstart in

Spaces

7,109

bigscience-workshop/promptsource

↳ Quickstart in

Spaces

2,499

openbiolink/promptsource

Tasks

Add Remove

Common Sense Reasoning

Coreference Resolution

Language Modelling

Machine Translation

Natural Language Inference

Question Answering

RTE

Sentence Completion

Sentiment Analysis

Zero-Shot Learning

Datasets

GLUE

IMDb Movie Reviews

SNLI

QNLI

Natural Questions

TriviaQA

HellaSwag

BoolQ

SuperGLUE

PIQA

OpenBookQA

WinoGrande

WSC

DROP

COPA

WMT 2014

ANLI

WMT 2016

MultiRC

ReCoRD

ARC (AI2 Reasoning Challenge) ParaCrawl RTE StoryCloze

WMT 2016 News WNLI

Results from the Paper

Edit

Ranked #1 on Question Answering on OBQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Challenge)	FLAN 137B (zero-shot)	Accuracy	63.1	# 19	Compare
Common Sense Reasoning	ARC (Challenge)	FLAN 137B (few-shot, k=13)	Accuracy	63.8	# 18	Compare
Common Sense Reasoning	ARC (Easy)	FLAN 137B (0-shot)	Accuracy	79.6	# 14	Compare
Common Sense Reasoning	ARC (Easy)	FLAN 137B (few-shot, k=14)	Accuracy	80.7	# 10	Compare
Question Answering	BoolQ	FLAN 137B (prompt-tuned)	Accuracy	86.3	# 13	Compare
Question Answering	BoolQ	FLAN 137B (4-shot)	Accuracy	84.6	# 18	Compare
Question Answering	BoolQ	FLAN 137B (0-shot)	Accuracy	82.9	# 23	Compare
Question Answering	COPA	FLAN 137B (prompt-tuned)	Accuracy	94	# 10	Compare
Question Answering	COPA	FLAN 137B (few-shot, k=16)	Accuracy	87	# 22	Compare
Question Answering	COPA	FLAN 137B (zero-shot)	Accuracy	91	# 13	Compare
Sentence Completion	HellaSwag	FLAN 137B (3-shot)	Accuracy	59.2	# 56	Compare
Sentence Completion	HellaSwag	FLAN 137B (0-shot)	Accuracy	56.7	# 58	Compare
Sentiment Analysis	IMDb	FLAN 137B (few-shot, k=2)	Accuracy	95	# 17	Compare
Sentiment Analysis	IMDb	FLAN 137B (zero-shot)	Accuracy	94.3	# 21	Compare
Question Answering	MultiRC	FLAN 137B (zero-shot)	F1	77.5	# 12	Compare
Question Answering	MultiRC	FLAN 137B (prompt-tuned)	F1	83.4	# 11	Compare
Question Answering	MultiRC	FLAN 137B (1-shot)	F1	72.1	# 14	Compare
Question Answering	NaturalQA	FLAN 137B (zero-shot)	EM	20.7	# 2	Compare
Question Answering	OBQA	FLAN 137B (few-shot, k=16)	Accuracy	78.2	# 2	Compare
Question Answering	OBQA	FLAN 137B (zero-shot)	Accuracy	78.4	# 1	Compare
Question Answering	PIQA	FLAN 137B (0-shot)	Accuracy	80.5	# 26	Compare
Question Answering	PIQA	FLAN 137B (few-shot, k=10)	Accuracy	81.7	# 22	Compare
Common Sense Reasoning	ReCoRD	FLAN 137B (prompt-tuned)	EM	85.1	# 13	Compare
Common Sense Reasoning	ReCoRD	FLAN 137B (zero-shot)	EM	72.5	# 23	Compare
Natural Language Inference	RTE	FLAN 137B (0-shot)	Accuracy	84.1%	# 30	Compare
Natural Language Inference	RTE	FLAN 137B (prompt-tuned)	Accuracy	91.7%	# 13	Compare
Natural Language Inference	RTE	FLAN 137B (8-shot)	Accuracy	84.5%	# 29	Compare
Question Answering	StoryCloze	FLAN 137B (few-shot, k=10)	Accuracy	94.7	# 3	Compare
Question Answering	StoryCloze	FLAN 137B (zero-shot)	Accuracy	93.4	# 6	Compare
Question Answering	TriviaQA	FLAN 137B (zero-shot)	EM	56.7	# 32	Compare
Coreference Resolution	Winograd Schema Challenge	FLAN 137B (zero-shot)	Accuracy	80.8	# 20	Compare
Coreference Resolution	Winograd Schema Challenge	FLAN 137B (prompt-tuned)	Accuracy	86.5	# 15	Compare
Common Sense Reasoning	WinoGrande	FLAN 137B (few-shot, k=16)	Accuracy	72.8	# 30	Compare
Common Sense Reasoning	WinoGrande	FLAN 137B (0-shot)	Accuracy	71.2	# 32	Compare
Machine Translation	WMT2014 English-French	FLAN 137B (zero-shot)	BLEU score	33.9	# 48	Compare
Machine Translation	WMT2014 English-French	FLAN 137B (few-shot, k=9)	BLEU score	33.8	# 49	Compare
Machine Translation	WMT2014 French-English	FLAN 137B (zero-shot)	BLEU score	35.9	# 2	Compare
Machine Translation	WMT2014 French-English	FLAN 137B (few-shot, k=9)	BLEU score	37.9	# 1	Compare
Machine Translation	WMT2016 English-German	FLAN 137B (few-shot, k=11)	BLEU score	26.1	# 7	Compare
Machine Translation	WMT2016 English-German	FLAN 137B (zero-shot)	BLEU score	27.0	# 5	Compare
Machine Translation	WMT2016 English-Romanian	FLAN 137B (zero-shot)	BLEU score	18.9	# 20	Compare
Machine Translation	WMT2016 English-Romanian	FLAN 137B (few-shot, k=9)	BLEU score	20.5	# 19	Compare
Machine Translation	WMT2016 German-English	FLAN 137B (few-shot, k=11)	BLEU score	40.7	# 1	Compare
Machine Translation	WMT2016 German-English	FLAN 137B (zero-shot)	BLEU score	38.9	# 2	Compare
Machine Translation	WMT2016 Romanian-English	FLAN 137B (few-shot, k=9)	BLEU score	38.1	# 2	Compare
Machine Translation	WMT2016 Romanian-English	FLAN 137B (zero-shot)	BLEU score	37.3	# 3	Compare
Natural Language Inference	WNLI	FLAN 137B (zero-shot)	Accuracy	74.6	# 14	Compare
Natural Language Inference	WNLI	FLAN 137B (few-shot, k=4)	Accuracy	70.4	# 17	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Finetuned Language Models Are Zero-Shot Learners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove