TinyBERT: Distilling BERT for Natural Language Understanding

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

PDF Abstract Findings of 2020 PDF Findings of 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Linguistic Acceptability CoLA TinyBERT-4 14.5M Accuracy 43.3% # 40
Linguistic Acceptability CoLA Dev TinyBERT-6 67M Accuracy 54 # 4
Semantic Textual Similarity MRPC TinyBERT-6 67M Accuracy 87.3% # 29
Semantic Textual Similarity MRPC TinyBERT-4 14.5M Accuracy 86.4% # 32
Semantic Textual Similarity MRPC Dev TinyBERT-6 67M Accuracy 86.3 # 2
Natural Language Inference MultiNLI TinyBERT-4 14.5M Matched 82.5 # 35
Mismatched 81.8 # 26
Natural Language Inference MultiNLI TinyBERT-6 67M Matched 84.6 # 29
Mismatched 83.2 # 23
Natural Language Inference MultiNLI Dev TinyBERT-6 67M Matched 84.5 # 1
Mismatched 84.5 # 1
Natural Language Inference QNLI TinyBERT-6 67M Accuracy 90.4% # 34
Natural Language Inference QNLI TinyBERT-4 14.5M Accuracy 87.7% # 39
Paraphrase Identification Quora Question Pairs TinyBERT F1 71.3 # 14
Natural Language Inference RTE TinyBERT-6 67M Accuracy 66% # 64
Natural Language Inference RTE TinyBERT-4 14.5M Accuracy 62.9% # 68
Question Answering SQuAD1.1 dev TinyBERT-6 67M EM 79.7 # 15
F1 87.5 # 16
Question Answering SQuAD2.0 dev TinyBERT-6 67M F1 73.4 # 13
EM 69.9 # 12
Sentiment Analysis SST-2 Binary classification TinyBERT-4 14.5M Accuracy 92.6 # 46
Sentiment Analysis SST-2 Binary classification TinyBERT-6 67M Accuracy 93.1 # 42
Semantic Textual Similarity STS Benchmark TinyBERT-4 14.5M Pearson Correlation 0.799 # 28

Methods