BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

PDF Abstract NAACL 2019 PDF NAACL 2019 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Stock Market Prediction Astock Bert Chinese Accuray 59.11 # 15
F1-score 58.99 # 15
Recall 59.20 # 15
Precision 59.07 # 15
Linguistic Acceptability CoLA BERT-LARGE Accuracy 60.5% # 25
Named Entity Recognition CoNLL 2003 (English) BERT-BASE F1 92.4 # 2
Named Entity Recognition CoNLL 2003 (English) BERT-LARGE F1 92.8 # 1
Question Answering CoQA BERT-base finetune (single model) In-domain 79.8 # 2
Out-of-domain 74.1 # 2
Overall 78.1 # 4
Question Answering CoQA BERT Large Augmented (single model) In-domain 82.5 # 1
Out-of-domain 77.6 # 1
Overall 81.1 # 2
Emotion Recognition in Conversation CPED BERT_{utt} Accuracy of Sentiment 48.96 # 5
Macro-F1 of Sentiment 45.18 # 4
Text Classification DBpedia Bidirectional Encoder Representations from Transformers Error 0.64 # 2
Natural Language Understanding GLUE BERT-LARGE Average 82.1 # 2
Type prediction ManyTypes4TypeScript BERT Average Accuracy 57.52 # 9
Average Precision 54.18 # 7
Average Recall 54.02 # 7
Average F1 54.10 # 7
Semantic Textual Similarity MRPC BERT-LARGE F1 89.3 # 12
Question Answering MRQA BERT (large) Average F1 78.5 # 2
Natural Language Inference MultiNLI BERT-LARGE Matched 86.7 # 22
Mismatched 85.9 # 17
Question Answering MultiRC BERT-large(single model) F1 70.0 # 16
EM 24.1 # 11
Named Entity Recognition NCBI-disease BERT Base F1 86.37 # 1
Multimodal Intent Recognition PhotoChat BERT F1 53.2 # 4
Precision 56.1 # 3
Recall 50.6 # 6
Natural Language Inference QNLI BERT-LARGE Accuracy 92.7% # 25
Paraphrase Identification Quora Question Pairs BERT-LARGE F1 72.1 # 12
Common Sense Reasoning ReCoRD BERT-Base (single model) F1 56.065 # 32
EM 54.040 # 33
Natural Language Inference RTE BERT-large 340M Accuracy 70.1% # 53
Named Entity Recognition SciERC BERT Base F1 65.24 # 1
Linear-Probe Classification SentEval BERT Accuracy 84.9 # 7
Question Answering SQuAD1.1 BERT-LARGE (Single+TriviaQA) F1 91.8 # 30
Question Answering SQuAD1.1 BERT-LARGE (Ensemble+TriviaQA) EM 87.4 # 18
F1 93.2 # 16
Question Answering SQuAD1.1 BERT (ensemble) EM 87.433 # 17
F1 93.160 # 17
Question Answering SQuAD1.1 BERT (single model) EM 85.083 # 29
F1 91.835 # 28
Question Answering SQuAD1.1 dev BERT-LARGE (Single+TriviaQA) EM 84.2 # 9
F1 91.1 # 9
Question Answering SQuAD1.1 dev BERT-LARGE (Ensemble+TriviaQA) EM 86.2 # 7
F1 92.2 # 7
Sentiment Analysis SST-2 Binary classification BERT-LARGE Accuracy 94.9 # 27
Semantic Textual Similarity STS Benchmark BERT-LARGE Spearman Correlation 0.865 # 20
Common Sense Reasoning SWAG BERT-LARGE Dev 86.6 # 1
Test 86.3 # 3
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-German BERT Accuracy 70.5% # 2
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-Spanish BERT Accuracy 74.3% # 2

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Natural Language Understanding PDP60 BERT-large 340M Accuracy 78.3 # 2
Coreference Resolution Winograd Schema Challenge BERT-large 340M Accuracy 62.0 # 53
Natural Language Inference WNLI BERT-large 340M Accuracy 65.1 # 20
Question Answering PIQA BERT-Large 340M Accuracy 66.7 # 53

Methods