ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Linguistic Acceptability CoLA ALBERT Accuracy 69.1% # 10
Multi-task Language Understanding MMLU ALBERT-xxlarge 223M (fine-tuned) Average (%) 27.1 # 92
Semantic Textual Similarity MRPC ALBERT Accuracy 93.4% # 2
Natural Language Inference MultiNLI ALBERT Matched 91.3 # 5
Multimodal Intent Recognition PhotoChat ALBERT-base F1 52.2 # 6
Precision 44.8 # 6
Recall 62.7 # 3
Natural Language Inference QNLI ALBERT Accuracy 99.2% # 1
Question Answering Quora Question Pairs ALBERT Accuracy 90.5% # 3
Natural Language Inference RTE ALBERT Accuracy 89.2% # 16
Question Answering SQuAD2.0 ALBERT (single model) EM 88.107 # 64
F1 90.902 # 67
Question Answering SQuAD2.0 ALBERT (ensemble model) EM 89.731 # 27
F1 92.215 # 28
Question Answering SQuAD2.0 dev ALBERT xxlarge F1 88.1 # 4
EM 85.1 # 4
Question Answering SQuAD2.0 dev ALBERT xlarge F1 85.9 # 7
EM 83.1 # 6
Question Answering SQuAD2.0 dev ALBERT base F1 79.1 # 10
EM 76.1 # 9
Question Answering SQuAD2.0 dev ALBERT large F1 82.1 # 9
EM 79.0 # 8
Sentiment Analysis SST-2 Binary classification ALBERT Accuracy 97.1 # 5
Semantic Textual Similarity STS Benchmark ALBERT Pearson Correlation 0.925 # 4
Natural Language Inference WNLI ALBERT Accuracy 91.8 # 5

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Common Sense Reasoning CommonsenseQA Albert Lan et al. (2020) (ensemble) Accuracy 76.5 # 10

Methods