XLNet: Generalized Autoregressive Pretraining for Language Understanding

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

PDF Abstract NeurIPS 2019 PDF NeurIPS 2019 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Humor Detection 200k Short Texts for Humor Detection XLNet Large Cased F1-score 0.920 # 2
Text Classification AG News XLNet Error 4.45 # 1
Text Classification Amazon-2 XLNet Error 2.11 # 1
Text Classification Amazon-5 XLNet Error 31.67 # 1
Natural Language Inference ANLI test XLNet (Large) A1 70.3 # 7
A2 50.9 # 12
A3 49.4 # 13
Document Ranking ClueWeb09-B XLNet nDCG@20 31.10 # 1
ERR@20 20.28 # 1
Linguistic Acceptability CoLA XLNet (single model) Accuracy 69% # 11
Text Classification DBpedia XLNet Error 0.62 # 1
Sentiment Analysis IMDb XLNet Accuracy 96.21 # 3
Semantic Textual Similarity MRPC XLNet (single model) Accuracy 90.8% # 9
Natural Language Inference MultiNLI XLNet (single model) Matched 90.8 # 8
Natural Language Inference QNLI XLNet (single model) Accuracy 94.9% # 11
Paraphrase Identification Quora Question Pairs XLNet-Large (ensemble) Accuracy 90.3 # 6
F1 74.2 # 7
Question Answering Quora Question Pairs XLNet (single model) Accuracy 92.3% # 1
Question Answering RACE XLNet RACE-m 85.45 # 1
RACE 81.75 # 1
Reading Comprehension RACE XLNet Accuracy (High) 84.0 # 5
Accuracy (Middle) 88.6 # 5
Audio Question Answering RoadTracer XLNet RACE-h 80.21 # 1
Chinese Reading Comprehension RoadTracer XLNet Accuracy 85.4 # 1
Natural Language Inference RTE XLNet (single model) Accuracy 85.9% # 24
Semantic Textual Similarity SentEval XLNet-Large MRPC 93.0/90.7 # 1
SICK-R - # 3
SICK-E - # 3
STS 91.6/91.1* # 1
Question Answering SQuAD1.1 XLNet (single model) EM 89.898 # 5
F1 95.080 # 5
Hardware Burden 46449G # 1
Question Answering SQuAD1.1 dev XLNet (single model) EM 89.7 # 4
F1 95.1 # 3
Question Answering SQuAD2.0 XLNet (single model) EM 87.926 # 77
F1 90.689 # 79
Question Answering SQuAD2.0 dev XLNet (single model) F1 90.6 # 1
EM 87.9 # 1
Sentiment Analysis SST-2 Binary classification XLNet-Large (ensemble) Accuracy 96.8 # 10
Sentiment Analysis SST-2 Binary classification XLNet (single model) Accuracy 97 # 7
Semantic Textual Similarity STS Benchmark XLNet (single model) Pearson Correlation 0.925 # 4
Natural Language Inference WNLI XLNet Accuracy 92.5 # 4
Text Classification Yelp-2 XLNet Accuracy 98.63% # 1
Text Classification Yelp-5 XLNet Accuracy 72.95% # 2
Sentiment Analysis Yelp Binary classification XLNet Error 1.37 # 1
Sentiment Analysis Yelp Fine-grained classification XLNet Error 27.05 # 1

Methods