Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

17 Sep 2019Mohammad ShoeybiMostofa PatwaryRaul PuriPatrick LeGresleyJared CasperBryan Catanzaro

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Reading Comprehension RACE Megatron-BERT (ensemble) Accuracy 90.9 # 1
Accuracy (High) 93.1 # 1
Accuracy (Middle) 90.0 # 1
Reading Comprehension RACE Megatron-BERT Accuracy 89.5 # 3
Accuracy (High) 91.8 # 3
Accuracy (Middle) 88.6 # 4
Language Modelling WikiText-103 Megatron-LM Test perplexity 10.81 # 1
Number of params 8300M # 1

Methods used in the Paper