TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
D4RL	D4RL	Primal.+DT	Average Reward	77.5	# 3
Offline RL	D4RL	Primal.+DT	Average Reward	77.5	# 2
Long-range modeling	LRA	Primal.+Trans.	ListOps	37.3	# 19
Long-range modeling	LRA	Primal.+Trans.	Text	65.4	# 20
Long-range modeling	LRA	Primal.+Trans.	Image	43.9	# 20
Long-range modeling	LRA	Primal.+Trans.	Pathfinder	74.3	# 21
Long-range modeling	LRA	Primal.+Trans.	Avg	60.4	# 19
Time Series Classification	UEA	Primal.+Trans.	ACC	73.1	# 2
Language Modelling	WikiText-103	Primal.+Trans.	Test perplexity	31.0	# 71

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primal-attention-self-attention-through/offline-rl-on-d4rl)](https://paperswithcode.com/sota/offline-rl-on-d4rl?p=primal-attention-self-attention-through)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primal-attention-self-attention-through/time-series-classification-on-uea)](https://paperswithcode.com/sota/time-series-classification-on-uea?p=primal-attention-self-attention-through)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primal-attention-self-attention-through/d4rl-on-d4rl)](https://paperswithcode.com/sota/d4rl-on-d4rl?p=primal-attention-self-attention-through)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primal-attention-self-attention-through/long-range-modeling-on-lra)](https://paperswithcode.com/sota/long-range-modeling-on-lra?p=primal-attention-self-attention-through)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primal-attention-self-attention-through/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=primal-attention-self-attention-through)`

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

NeurIPS 2023 · Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan A. K. Suykens ·

Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

yingyichen-cyy/PrimalAttention official

Tasks

Add Remove

D4RL

Language Modelling

Long-range modeling

Offline RL

Time Series Classification

Datasets

ImageNet

WikiText-2 ImageNet-1K

WikiText-103

D4RL LRA

ListOps

Results from the Paper

Edit

Ranked #2 on Offline RL on D4RL

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
D4RL	D4RL	Primal.+DT	Average Reward	77.5	# 3	Compare
Offline RL	D4RL	Primal.+DT	Average Reward	77.5	# 2	Compare
Long-range modeling	LRA	Primal.+Trans.	ListOps	37.3	# 19	Compare
			Text	65.4	# 20	Compare
			Image	43.9	# 20	Compare
			Pathfinder	74.3	# 21	Compare
			Avg	60.4	# 19	Compare
Time Series Classification	UEA	Primal.+Trans.	ACC	73.1	# 2	Compare
Language Modelling	WikiText-103	Primal.+Trans.	Test perplexity	31.0	# 71	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove