DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

NeurIPS 2019 Victor SanhLysandre DebutJulien ChaumondThomas Wolf

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Linguistic Acceptability CoLA DistilBERT Accuracy 49.1% # 15
Sentiment Analysis IMDb DistilBERT Accuracy 92.82 # 9
Semantic Textual Similarity MRPC DistilBERT Accuracy 90.2% # 6
Natural Language Inference QNLI DistilBERT Accuracy 90.2% # 15
Question Answering Quora Question Pairs DistilBERT Accuracy 89.2% # 10
Natural Language Inference RTE DistilBERT Accuracy 62.9% # 17
Question Answering SQuAD1.1 dev DistilBERT EM 77.7 # 13
F1 85.8 # 15
Sentiment Analysis SST-2 Binary classification DistilBERT Accuracy 91.3 # 19
Semantic Textual Similarity STS Benchmark DistilBERT Pearson Correlation 0.907 # 6
Natural Language Inference WNLI DistilBERT Accuracy 44.4% # 9

Methods used in the Paper