Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract

Results from the Paper


 Ranked #1 on Question Answering on CoQA (Overall metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Natural Language Inference ANLI test GPT-3 A1 36.8 # 13
A2 34 # 23
A3 40.2 # 18
Common Sense Reasoning ARC (Challenge) GPT-3 175B (1 shot) Accuracy 53.2 # 25
Common Sense Reasoning ARC (Challenge) GPT-3 175B (0-shot) Accuracy 51.4 # 27
Common Sense Reasoning ARC (Easy) GPT-3 175B (1 shot) Accuracy 71.2 # 25
Common Sense Reasoning ARC (Easy) GPT-3 175B (0-shot) Accuracy 68.8 # 31
Question Answering BoolQ GPT-3 75B (0-shot) Accuracy 60.5 # 50
Question Answering BoolQ GPT-3 175B (few-shot, k=32) Accuracy 76.4 # 31
Natural Language Inference CommitmentBank GPT-3 175B (few-shot, k=32) F1 52 # 9
Natural Language Inference CommitmentBank GPT-3 175B (Few-Shot) Accuracy 75.6 # 13
Question Answering COPA GPT-3 Large 760M (0-shot) Accuracy 73.0 # 46
Question Answering COPA GPT-3 175B (few-shot, k=32) Accuracy 92 # 11
Question Answering COPA GPT-3 13B (few-shot, k=32) Accuracy 86 # 25
Question Answering COPA GPT-3 175B (1-shot) Accuracy 87 # 22
Question Answering COPA GPT-3 175B (0-shot) Accuracy 91 # 13
Question Answering CoQA GPT-3 175B (few-shot, k=32) Overall 85 # 1
Question Answering DROP Test GPT-3 175B (few-shot, k=32) F1 36.5 # 15
Sentence Completion HellaSwag GPT-3 Large 760M (0-shot) Accuracy 51.0 # 60
Sentence Completion HellaSwag GPT-3 (0-shot) Accuracy 78.9 # 43
Sentence Completion HellaSwag GPT-3 175B (few-shot, k=32) Accuracy 79.3 # 40
Language Modelling LAMBADA GPT-3 13B (Zero-Shot) Accuracy 72.5 # 20
Perplexity 3.56 # 3
Language Modelling LAMBADA GPT-3 2.7B (Zero-Shot) Accuracy 67.1 # 28
Perplexity 4.60 # 9
Language Modelling LAMBADA GPT-3 6.7B (Zero-Shot) Accuracy 70.3 # 23
Perplexity 4.00 # 6
Language Modelling LAMBADA GPT-3 175B (Zero-Shot) Accuracy 76.2 # 18
Perplexity 3.00 # 2
Language Modelling LAMBADA GPT-3 175B (Few-Shot) Accuracy 86.4 # 3
Perplexity 1.92 # 1
Multi-task Language Understanding MMLU GPT-3 175B (5-shot) Average (%) 43.9 # 68
Multi-task Language Understanding MMLU GPT-3 2.7B (5-shot) Average (%) 25.9 # 94
Multi-task Language Understanding MMLU GPT-3 6.7B (5-shot) Average (%) 24.9 # 98
Multi-task Language Understanding MMLU GPT-3 13B (few-shot, k=32) Average (%) 26 # 92
Question Answering MultiRC GPT-3 175B (Few-Shot) F1 75.4 # 13
Question Answering Natural Questions GPT-3 175B (Few-Shot, k=64) EM 29.9 # 26
Question Answering OBQA GPT-3 175B (zero-shot) Accuracy 57.6 # 5
Question Answering OpenBookQA GPT-3 175B (few-shot, k=32) Accuracy 65.4 # 25
Language Modelling Penn Treebank (Word Level) GPT-3 (Zero-Shot) Test perplexity 20.5 # 1
Params 175000M # 1
Question Answering PIQA GPT-3 175B (0-shot) Accuracy 81.0 # 24
Question Answering PIQA GPT-3 Large 760M (0-shot) Accuracy 72.9 # 44
Question Answering QuAC GPT-3 175B (few-shot, k=32) F1 44.3 # 2
Reading Comprehension RACE GPT-3 175B (zero-shot) Accuracy (High) 45.5 # 13
Reading Comprehension RACE GPT-3 175B (0-shot) Accuracy (Middle) 58.4 # 13
Question Answering RACE GPT-3 175B (Few-Shot) RACE-h 46.8 # 5
Question Answering RACE GPT-3 175B (few-shot, k=32) RACE-m 58.1 # 6
Common Sense Reasoning ReCoRD GPT-3 Large 760M (0-shot) EM 82.1 # 15
Natural Language Inference RTE GPT-3 175B (few-shot, k=32) Accuracy 69% # 57
Question Answering StoryCloze GPT-3 Large 760M (zero-shot) Accuracy 72.4 # 19
Question Answering Story Cloze GPT-3 175B (Few-Shot) Accuracy 87.7 # 2
Question Answering TriviaQA GPT-3 175B (Few-Shot) EM 71.2 # 23
Question Answering WebQuestions GPT-3-175B (One-Shot) EM 25.3 # 13
Question Answering WebQuestions GPT-3-175B (Few-Shot) EM 41.5 # 8
Question Answering WebQuestions GPT-3-175B (Zero-Shot) EM 14.4 # 17
Coreference Resolution Winograd Schema Challenge GPT-3 175B (few-shot) Accuracy 80.1 # 21
Common Sense Reasoning WinoGrande GPT-3 175B (0-shot) Accuracy 70.2 # 36
Common Sense Reasoning WinoGrande GPT-3 Large 760M (0-shot) Accuracy 57.4 # 51
Unsupervised Machine Translation WMT2014 English-French GPT-3 175B (Few-Shot) BLEU 32.6 # 5
Unsupervised Machine Translation WMT2014 French-English GPT-3 175B (Few-Shot) BLEU 39.2 # 1
Unsupervised Machine Translation WMT2016 English-German GPT-3 175B (Few-Shot) BLEU 29.7 # 1
Unsupervised Machine Translation WMT2016 English-Romanian GPT-3 175B (Few-Shot) BLEU 21 # 1
Unsupervised Machine Translation WMT2016 German-English GPT-3 175B (Few-Shot) BLEU 40.6 # 1
Unsupervised Machine Translation WMT2016 Romanian-English GPT-3 175B (Few-Shot) BLEU 39.5 # 1
Word Sense Disambiguation Words in Context GPT-3 175B (few-shot, k=32) Accuracy 49.4 # 36

Methods