ST-MoE: Designing Stable and Transferable Sparse Expert Models

17 Feb 2022  ·  Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus ·

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) ST-MoE-L 4.1B (fine-tuned) Accuracy 56.9 # 22
Common Sense Reasoning ARC (Challenge) ST-MoE-32B 269B (fine-tuned) Accuracy 86.5 # 10
Common Sense Reasoning ARC (Easy) ST-MoE-32B 269B (fine-tuned) Accuracy 95.2 # 1
Common Sense Reasoning ARC (Easy) ST-MoE-L 4.1B (fine-tuned) Accuracy 75.4 # 19
Question Answering BoolQ ST-MoE-32B 269B (fine-tuned) Accuracy 92.4 # 1
Question Answering BoolQ ST-MoE-L 4.1B (fine-tuned) Accuracy 88.6 # 9
Natural Language Inference CommitmentBank ST-MoE-32B 269B (fine-tuned) Accuracy 98 # 4
Natural Language Inference CommitmentBank ST-MoE-L 4.1B (fine-tuned) Accuracy 98.2 # 3
Question Answering COPA ST-MoE-L 4.1B (fine-tuned) Accuracy 91 # 13
Question Answering COPA ST-MoE-32B 269B (fine-tuned) Accuracy 99.2 # 3
Question Answering MultiRC ST-MoE-32B 269B (fine-tuned) F1 89.6 # 2
Question Answering MultiRC ST-MoE-L 4.1B (fine-tuned) F1 86 # 8
Common Sense Reasoning ReCoRD ST-MoE-32B 269B (fine-tuned) EM 95.1 # 2
Common Sense Reasoning ReCoRD ST-MoE-L 4.1B (fine-tuned) EM 88.9 # 12
Natural Language Inference RTE ST-MoE-L 4.1B (fine-tuned) Accuracy 92.1% # 10
Natural Language Inference RTE ST-MoE-32B 269B (fine-tuned) Accuracy 93.5% # 4
Coreference Resolution Winograd Schema Challenge ST-MoE-L 4.1B (fine-tuned) Accuracy 93.3 # 8
Coreference Resolution Winograd Schema Challenge ST-MoE-32B 269B (fine-tuned) Accuracy 96.6 # 5
Common Sense Reasoning WinoGrande ST-MoE-32B 269B (fine-tuned) Accuracy 96.1 # 1
Common Sense Reasoning WinoGrande ST-MoE-L 4.1B (fine-tuned) Accuracy 81.7 # 10
Word Sense Disambiguation Words in Context ST-MoE-32B 269B (fine-tuned) Accuracy 77.7 # 3
Word Sense Disambiguation Words in Context ST-MoE-L 4.1B (fine-tuned) Accuracy 74 # 10

Methods