Reformer: The Efficient Transformer

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

PDF Abstract ICLR 2020 PDF ICLR 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Offline RL D4RL Reformer Average Reward 64.4 # 6
D4RL D4RL Reformer Average Reward 63.9 # 8
Question Answering Natural Questions (long) Locality-Sensitive Hashing F1 75.5 # 3

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Image Generation ImageNet 64x64 Reformer (12 layers) Bits per dim 3.710 # 19
Image Generation ImageNet 64x64 Reformer (6 layers) Bits per dim 3.740 # 22
Question Answering Quasart-T Locality-Sensitive Hashing EM 53.2 # 2
Open-Domain Question Answering SearchQA Locality-Sensitive Hashing EM 66.0 # 2
Language Modelling WikiText-103 Reformer 125M Test perplexity 26.0 # 62

Methods