Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

ACL 2019 Zihang DaiZhilin YangYiming YangJaime CarbonellQuoc V. LeRuslan Salakhutdinov

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling enwik8 Transformer-XL - 12 layers Bit per Character (BPC) 1.06 # 11
Number of params 41M # 14
Language Modelling enwik8 Transformer-XL - 18 layers Bit per Character (BPC) 1.03 # 10
Number of params 88M # 8
Language Modelling enwik8 Transformer-XL - 24 layers Bit per Character (BPC) 0.99 # 7
Number of params 277M # 2
Language Modelling Hutter Prize 18-layer Transformer-XL Bit per Character (BPC) 1.03 # 2
Number of params 88M # 3
Language Modelling Hutter Prize 24-layer Transformer-XL Bit per Character (BPC) 0.99 # 1
Number of params 277M # 1
Language Modelling Hutter Prize 12-layer Transformer-XL Bit per Character (BPC) 1.06 # 3
Number of params 41M # 7
Language Modelling One Billion Word Transformer-XL Base PPL 23.5 # 3
Number of params 0.46B # 1
Language Modelling One Billion Word Transformer-XL Large PPL 21.8 # 1
Number of params 0.8B # 1
Language Modelling Penn Treebank (Word Level) Transformer-XL Validation perplexity 56.72 # 15
Test perplexity 54.55 # 19
Params 24M # 6
Language Modelling Text8 Transformer-XL - 24 layers Bit per Character (BPC) 1.08 # 4
Number of params 277M # 1
Language Modelling WikiText-103 Transformer-XL Large Validation perplexity 18.2 # 7
Test perplexity 18.3 # 11
Number of params 257M # 6
Language Modelling WikiText-103 Transformer-XL Standard Validation perplexity 23.1 # 12
Test perplexity 24.0 # 18
Number of params 151M # 9

Methods used in the Paper