TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	CvP	Percentage correct	83.19	# 208
Image Classification	CIFAR-10	CvN	Percentage correct	83.26	# 207
Image Classification	CIFAR-10	CCN	Percentage correct	83.36	# 206
Image Classification	CIFAR-10	CCN	PARAMS	0.906075M	# 180
Image Classification	CIFAR-10	Vision Nystromformer (ViN)	Percentage correct	65.06	# 224
Image Classification	CIFAR-10	Vision Nystromformer (ViN)	PARAMS	0.530970M	# 172
Image Classification	CIFAR-10	Hybrid PiN	Percentage correct	74	# 222
Image Classification	CIFAR-10	Hybrid PiN	PARAMS	0.990298M	# 182
Image Classification	CIFAR-10	Hybrid Vision Nystromformer (ViN)	Percentage correct	75.26	# 221
Image Classification	CIFAR-10	Hybrid Vision Nystromformer (ViN)	PARAMS	0.623706M	# 174
Image Classification	CIFAR-10	Hybrid ViT+RoPE	Percentage correct	76.9	# 219
Image Classification	CIFAR-10	LeViP	Percentage correct	79.50	# 217

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-xformers-efficient-attention-for-image/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=vision-xformers-efficient-attention-for-image)`

Vision Xformers: Efficient Attention for Image Classification

5 Jul 2021 · Pranav Jeevan, Amit Sethi ·

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in {Performer, Linformer, Nystr\"omformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.

PDF Abstract

Code

Add Remove Mark official

pranavphoenix/ViX official

pranavphoenix/VisionXformer

Tasks

Add Remove

Classification

Image Classification

Inductive Bias

Position

Datasets

CIFAR-10

Tiny ImageNet LRA

Results from the Paper

Edit

Ranked #206 on Image Classification on CIFAR-10 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	CvP	Percentage correct	83.19	# 208	Compare
Image Classification	CIFAR-10	CvN	Percentage correct	83.26	# 207	Compare
Image Classification	CIFAR-10	CCN	Percentage correct	83.36	# 206	Compare
Image Classification	CIFAR-10	CCN	PARAMS	0.906075M	# 180	Compare
Image Classification	CIFAR-10	Vision Nystromformer (ViN)	Percentage correct	65.06	# 224	Compare
Image Classification	CIFAR-10	Vision Nystromformer (ViN)	PARAMS	0.530970M	# 172	Compare
Image Classification	CIFAR-10	Hybrid PiN	Percentage correct	74	# 222	Compare
Image Classification	CIFAR-10	Hybrid PiN	PARAMS	0.990298M	# 182	Compare
Image Classification	CIFAR-10	Hybrid Vision Nystromformer (ViN)	Percentage correct	75.26	# 221	Compare
Image Classification	CIFAR-10	Hybrid Vision Nystromformer (ViN)	PARAMS	0.623706M	# 174	Compare
Image Classification	CIFAR-10	Hybrid ViT+RoPE	Percentage correct	76.9	# 219	Compare
Image Classification	CIFAR-10	LeViP	Percentage correct	79.50	# 217	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • FAVOR+ • Label Smoothing • Layer Normalization • Linear Layer • Linformer • Multi-Head Attention • Multi-Head Linear Attention • Nyströmformer • Performer • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Vision Xformers: Efficient Attention for Image Classification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove