TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Generation	CIFAR-10	Performer (12 layers)	bits/dimension	3.310	# 48
Image Generation	CIFAR-10	Performer (6 layers)	bits/dimension	3.335	# 51
Offline RL	D4RL	Performer	Average Reward	63.9	# 7
D4RL	D4RL	Performer	Average Reward	63.8	# 9
Image Generation	ImageNet 64x64	Performer (12 layers)	Bits per dim	3.636	# 15
Image Generation	ImageNet 64x64	Performer (6 layers)	Bits per dim	3.719	# 21
Language Modelling	WikiText-103	Performer 125M	Test perplexity	26.8	# 64

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-attention-with-performers/offline-rl-on-d4rl)](https://paperswithcode.com/sota/offline-rl-on-d4rl?p=rethinking-attention-with-performers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-attention-with-performers/d4rl-on-d4rl)](https://paperswithcode.com/sota/d4rl-on-d4rl?p=rethinking-attention-with-performers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-attention-with-performers/image-generation-on-imagenet-64x64)](https://paperswithcode.com/sota/image-generation-on-imagenet-64x64?p=rethinking-attention-with-performers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-attention-with-performers/image-generation-on-cifar-10)](https://paperswithcode.com/sota/image-generation-on-cifar-10?p=rethinking-attention-with-performers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-attention-with-performers/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=rethinking-attention-with-performers)`

Rethinking Attention with Performers

ICLR 2021 · Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller ·

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Code

Add Remove Mark official

google-research/google-research official

32,732

tensorflow/models

76,577

facebookresearch/xformers

↳ Quickstart in

Colab

7,508

rosettacommons/rosettafold

1,926

idiap/fast-transformers

1,569

See all 12 implementations

Tasks

Add Remove

D4RL

Image Generation

Language Modelling

Offline RL

Datasets

CIFAR-10

WikiText-2

WikiText-103

D4RL PG-19

Results from the Paper

Edit

Ranked #7 on Offline RL on D4RL

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Offline RL	D4RL	Performer	Average Reward	63.9	# 7		Compare
D4RL	D4RL	Performer	Average Reward	63.8	# 9		Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Compare
Image Generation	CIFAR-10	Performer (12 layers)	bits/dimension	3.310	# 48	See all
Image Generation	CIFAR-10	Performer (6 layers)	bits/dimension	3.335	# 51	See all
Image Generation	ImageNet 64x64	Performer (12 layers)	Bits per dim	3.636	# 15	See all
Image Generation	ImageNet 64x64	Performer (6 layers)	Bits per dim	3.719	# 21	See all
Language Modelling	WikiText-103	Performer 125M	Test perplexity	26.8	# 64	See all

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • FAVOR+ • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Performer • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Rethinking Attention with Performers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit