Finetuned Language Models Are Zero-Shot Learners

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

PDF Abstract ICLR 2022 PDF ICLR 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) FLAN 137B (zero-shot) Accuracy 63.1 # 19
Common Sense Reasoning ARC (Challenge) FLAN 137B (few-shot, k=13) Accuracy 63.8 # 18
Common Sense Reasoning ARC (Easy) FLAN 137B (0-shot) Accuracy 79.6 # 14
Common Sense Reasoning ARC (Easy) FLAN 137B (few-shot, k=14) Accuracy 80.7 # 10
Question Answering BoolQ FLAN 137B (prompt-tuned) Accuracy 86.3 # 13
Question Answering BoolQ FLAN 137B (4-shot) Accuracy 84.6 # 18
Question Answering BoolQ FLAN 137B (0-shot) Accuracy 82.9 # 23
Question Answering COPA FLAN 137B (prompt-tuned) Accuracy 94 # 10
Question Answering COPA FLAN 137B (few-shot, k=16) Accuracy 87 # 22
Question Answering COPA FLAN 137B (zero-shot) Accuracy 91 # 13
Sentence Completion HellaSwag FLAN 137B (3-shot) Accuracy 59.2 # 56
Sentence Completion HellaSwag FLAN 137B (0-shot) Accuracy 56.7 # 58
Sentiment Analysis IMDb FLAN 137B (few-shot, k=2) Accuracy 95 # 17
Sentiment Analysis IMDb FLAN 137B (zero-shot) Accuracy 94.3 # 21
Question Answering MultiRC FLAN 137B (zero-shot) F1 77.5 # 12
Question Answering MultiRC FLAN 137B (prompt-tuned) F1 83.4 # 11
Question Answering MultiRC FLAN 137B (1-shot) F1 72.1 # 14
Question Answering NaturalQA FLAN 137B (zero-shot) EM 20.7 # 2
Question Answering OBQA FLAN 137B (few-shot, k=16) Accuracy 78.2 # 2
Question Answering OBQA FLAN 137B (zero-shot) Accuracy 78.4 # 1
Question Answering PIQA FLAN 137B (0-shot) Accuracy 80.5 # 26
Question Answering PIQA FLAN 137B (few-shot, k=10) Accuracy 81.7 # 22
Common Sense Reasoning ReCoRD FLAN 137B (prompt-tuned) EM 85.1 # 13
Common Sense Reasoning ReCoRD FLAN 137B (zero-shot) EM 72.5 # 23
Natural Language Inference RTE FLAN 137B (0-shot) Accuracy 84.1% # 30
Natural Language Inference RTE FLAN 137B (prompt-tuned) Accuracy 91.7% # 13
Natural Language Inference RTE FLAN 137B (8-shot) Accuracy 84.5% # 29
Question Answering StoryCloze FLAN 137B (few-shot, k=10) Accuracy 94.7 # 3
Question Answering StoryCloze FLAN 137B (zero-shot) Accuracy 93.4 # 6
Question Answering TriviaQA FLAN 137B (zero-shot) EM 56.7 # 32
Coreference Resolution Winograd Schema Challenge FLAN 137B (zero-shot) Accuracy 80.8 # 20
Coreference Resolution Winograd Schema Challenge FLAN 137B (prompt-tuned) Accuracy 86.5 # 15
Common Sense Reasoning WinoGrande FLAN 137B (few-shot, k=16) Accuracy 72.8 # 30
Common Sense Reasoning WinoGrande FLAN 137B (0-shot) Accuracy 71.2 # 32
Machine Translation WMT2014 English-French FLAN 137B (zero-shot) BLEU score 33.9 # 48
Machine Translation WMT2014 English-French FLAN 137B (few-shot, k=9) BLEU score 33.8 # 49
Machine Translation WMT2014 French-English FLAN 137B (zero-shot) BLEU score 35.9 # 2
Machine Translation WMT2014 French-English FLAN 137B (few-shot, k=9) BLEU score 37.9 # 1
Machine Translation WMT2016 English-German FLAN 137B (few-shot, k=11) BLEU score 26.1 # 7
Machine Translation WMT2016 English-German FLAN 137B (zero-shot) BLEU score 27.0 # 5
Machine Translation WMT2016 English-Romanian FLAN 137B (zero-shot) BLEU score 18.9 # 20
Machine Translation WMT2016 English-Romanian FLAN 137B (few-shot, k=9) BLEU score 20.5 # 19
Machine Translation WMT2016 German-English FLAN 137B (few-shot, k=11) BLEU score 40.7 # 1
Machine Translation WMT2016 German-English FLAN 137B (zero-shot) BLEU score 38.9 # 2
Machine Translation WMT2016 Romanian-English FLAN 137B (few-shot, k=9) BLEU score 38.1 # 2
Machine Translation WMT2016 Romanian-English FLAN 137B (zero-shot) BLEU score 37.3 # 3
Natural Language Inference WNLI FLAN 137B (zero-shot) Accuracy 74.6 # 14
Natural Language Inference WNLI FLAN 137B (few-shot, k=4) Accuracy 70.4 # 17

Methods