DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Linguistic Acceptability CoLA DistilBERT Accuracy 49.1% # 34
Sentiment Analysis IMDb DistilBERT Accuracy 92.82 # 28
Semantic Textual Similarity MRPC DistilBERT Accuracy 90.2% # 14
Only Connect Walls Dataset Task 1 (Grouping) OCW DistilBERT (BASE) Wasserstein Distance (WD) 86.7 ± .6 # 17
# Correct Groups 49 ± 4 # 19
Fowlkes Mallows Score (FMS) 29.1 ± .2 # 18
Adjusted Rand Index (ARI) 11.3 ± .3 # 18
Adjusted Mutual Information (AMI) 14.0 ± .3 # 18
# Solved Walls 0 ± 0 # 10
Natural Language Inference QNLI DistilBERT Accuracy 90.2% # 35
Question Answering Quora Question Pairs DistilBERT Accuracy 89.2% # 13
Natural Language Inference RTE DistilBERT Accuracy 62.9% # 66
Question Answering SQuAD1.1 dev DistilBERT EM 77.7 # 20
F1 85.8 # 22
Sentiment Analysis SST-2 Binary classification DistilBERT Accuracy 91.3 # 53
Semantic Textual Similarity STS Benchmark DistilBERT Pearson Correlation 0.907 # 16
Natural Language Inference WNLI DistilBERT Accuracy 44.4 # 23

Methods