Vision Xformers: Efficient Attention for Image Classification

5 Jul 2021  ·  Pranav Jeevan, Amit Sethi ·

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in {Performer, Linformer, Nystr\"omformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.

PDF Abstract

Results from the Paper


Ranked #206 on Image Classification on CIFAR-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CvP Percentage correct 83.19 # 208
Image Classification CIFAR-10 CvN Percentage correct 83.26 # 207
Image Classification CIFAR-10 CCN Percentage correct 83.36 # 206
PARAMS 0.906075M # 180
Image Classification CIFAR-10 Vision Nystromformer (ViN) Percentage correct 65.06 # 224
PARAMS 0.530970M # 172
Image Classification CIFAR-10 Hybrid PiN Percentage correct 74 # 222
PARAMS 0.990298M # 182
Image Classification CIFAR-10 Hybrid Vision Nystromformer (ViN) Percentage correct 75.26 # 221
PARAMS 0.623706M # 174
Image Classification CIFAR-10 Hybrid ViT+RoPE Percentage correct 76.9 # 219
Image Classification CIFAR-10 LeViP Percentage correct 79.50 # 217

Methods