TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Linguistic Acceptability	CoLA	TinyBERT-4 14.5M	Accuracy	43.3%	# 40
Linguistic Acceptability	CoLA Dev	TinyBERT-6 67M	Accuracy	54	# 4
Semantic Textual Similarity	MRPC	TinyBERT-6 67M	Accuracy	87.3%	# 29
Semantic Textual Similarity	MRPC	TinyBERT-4 14.5M	Accuracy	86.4%	# 32
Semantic Textual Similarity	MRPC Dev	TinyBERT-6 67M	Accuracy	86.3	# 2
Natural Language Inference	MultiNLI	TinyBERT-4 14.5M	Matched	82.5	# 35
Natural Language Inference	MultiNLI	TinyBERT-4 14.5M	Mismatched	81.8	# 26
Natural Language Inference	MultiNLI	TinyBERT-6 67M	Matched	84.6	# 29
Natural Language Inference	MultiNLI	TinyBERT-6 67M	Mismatched	83.2	# 23
Natural Language Inference	MultiNLI Dev	TinyBERT-6 67M	Matched	84.5	# 1
Natural Language Inference	MultiNLI Dev	TinyBERT-6 67M	Mismatched	84.5	# 1
Natural Language Inference	QNLI	TinyBERT-6 67M	Accuracy	90.4%	# 34
Natural Language Inference	QNLI	TinyBERT-4 14.5M	Accuracy	87.7%	# 39
Paraphrase Identification	Quora Question Pairs	TinyBERT	F1	71.3	# 14
Natural Language Inference	RTE	TinyBERT-6 67M	Accuracy	66%	# 64
Natural Language Inference	RTE	TinyBERT-4 14.5M	Accuracy	62.9%	# 68
Question Answering	SQuAD1.1 dev	TinyBERT-6 67M	EM	79.7	# 15
Question Answering	SQuAD1.1 dev	TinyBERT-6 67M	F1	87.5	# 16
Question Answering	SQuAD2.0 dev	TinyBERT-6 67M	F1	73.4	# 13
Question Answering	SQuAD2.0 dev	TinyBERT-6 67M	EM	69.9	# 12
Sentiment Analysis	SST-2 Binary classification	TinyBERT-4 14.5M	Accuracy	92.6	# 46
Sentiment Analysis	SST-2 Binary classification	TinyBERT-6 67M	Accuracy	93.1	# 42
Semantic Textual Similarity	STS Benchmark	TinyBERT-4 14.5M	Pearson Correlation	0.799	# 28

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/natural-language-inference-on-multinli-dev)](https://paperswithcode.com/sota/natural-language-inference-on-multinli-dev?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/semantic-textual-similarity-on-mrpc-dev)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc-dev?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/linguistic-acceptability-on-cola-dev)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola-dev?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/question-answering-on-squad20-dev)](https://paperswithcode.com/sota/question-answering-on-squad20-dev?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/paraphrase-identification-on-quora-question)](https://paperswithcode.com/sota/paraphrase-identification-on-quora-question?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/question-answering-on-squad11-dev)](https://paperswithcode.com/sota/question-answering-on-squad11-dev?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/semantic-textual-similarity-on-sts-benchmark)](https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/semantic-textual-similarity-on-mrpc)](https://paperswithcode.com/sota/semantic-textual-similarity-on-mrpc?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/natural-language-inference-on-multinli)](https://paperswithcode.com/sota/natural-language-inference-on-multinli?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/natural-language-inference-on-qnli)](https://paperswithcode.com/sota/natural-language-inference-on-qnli?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/linguistic-acceptability-on-cola)](https://paperswithcode.com/sota/linguistic-acceptability-on-cola?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/sentiment-analysis-on-sst-2-binary)](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary?p=190910351)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/190910351/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=190910351)`

TinyBERT: Distilling BERT for Natural Language Understanding

Findings of the Association for Computational Linguistics 2020 · Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu ·

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

PDF Abstract Findings of 2020 PDF Findings of 2020 Abstract

Code

Add Remove Mark official

huawei-noah/Pretrained-Language-Mod… official

2,957

PaddlePaddle/PaddleNLP

11,406

mindspore-ai/models

219

mkavim/finetune_bert

millenialSpirou/ift6010

See all 7 implementations

Tasks

Add Remove

Knowledge Distillation

Language Modelling

Linguistic Acceptability

Natural Language Inference

Natural Language Understanding

Paraphrase Identification

Question Answering

Semantic Textual Similarity

Sentiment Analysis

Datasets

GLUE

SST

SQuAD

MultiNLI SST-2

QNLI

MRPC

CoLA

Quora

Quora Question Pairs RTE STS Benchmark

Results from the Paper

Edit

Ranked #1 on Natural Language Inference on MultiNLI Dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Linguistic Acceptability	CoLA	TinyBERT-4 14.5M	Accuracy	43.3%	# 40	Compare
Linguistic Acceptability	CoLA Dev	TinyBERT-6 67M	Accuracy	54	# 4	Compare
Semantic Textual Similarity	MRPC	TinyBERT-6 67M	Accuracy	87.3%	# 29	Compare
Semantic Textual Similarity	MRPC	TinyBERT-4 14.5M	Accuracy	86.4%	# 32	Compare
Semantic Textual Similarity	MRPC Dev	TinyBERT-6 67M	Accuracy	86.3	# 2	Compare
Natural Language Inference	MultiNLI	TinyBERT-4 14.5M	Matched	82.5	# 35	Compare
Natural Language Inference	MultiNLI	TinyBERT-4 14.5M	Mismatched	81.8	# 26	Compare
Natural Language Inference	MultiNLI	TinyBERT-6 67M	Matched	84.6	# 29	Compare
Natural Language Inference	MultiNLI	TinyBERT-6 67M	Mismatched	83.2	# 23	Compare
Natural Language Inference	MultiNLI Dev	TinyBERT-6 67M	Matched	84.5	# 1	Compare
Natural Language Inference	MultiNLI Dev	TinyBERT-6 67M	Mismatched	84.5	# 1	Compare
Natural Language Inference	QNLI	TinyBERT-6 67M	Accuracy	90.4%	# 34	Compare
Natural Language Inference	QNLI	TinyBERT-4 14.5M	Accuracy	87.7%	# 39	Compare
Paraphrase Identification	Quora Question Pairs	TinyBERT	F1	71.3	# 14	Compare
Natural Language Inference	RTE	TinyBERT-6 67M	Accuracy	66%	# 64	Compare
Natural Language Inference	RTE	TinyBERT-4 14.5M	Accuracy	62.9%	# 68	Compare
Question Answering	SQuAD1.1 dev	TinyBERT-6 67M	EM	79.7	# 15	Compare
Question Answering	SQuAD1.1 dev	TinyBERT-6 67M	F1	87.5	# 16	Compare
Question Answering	SQuAD2.0 dev	TinyBERT-6 67M	F1	73.4	# 13	Compare
Question Answering	SQuAD2.0 dev	TinyBERT-6 67M	EM	69.9	# 12	Compare
Sentiment Analysis	SST-2 Binary classification	TinyBERT-4 14.5M	Accuracy	92.6	# 46	Compare
Sentiment Analysis	SST-2 Binary classification	TinyBERT-6 67M	Accuracy	93.1	# 42	Compare
Semantic Textual Similarity	STS Benchmark	TinyBERT-4 14.5M	Pearson Correlation	0.799	# 28	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BERT • BPE • Dense Connections • Dropout • GELU • Knowledge Distillation • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Weight Decay • WordPiece

Edit Social Preview

TinyBERT: Distilling BERT for Natural Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove