Vision Transformer

Introduced by Dosovitskiy et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	82	9.46%
Image Classification	66	7.61%
Object Detection	36	4.15%
Self-Supervised Learning	33	3.81%
Image Segmentation	24	2.77%
Instance Segmentation	18	2.08%
Classification	16	1.85%
Language Modelling	14	1.61%
Autonomous Driving	14	1.61%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Image Models

Vision Transformers