Twins-SVT

Introduced by Chu et al. in Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins-SVT is a type of vision transformer which utilizes a spatially separable attention mechanism (SSAM) which is composed of two types of attention operations—(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes conditional position encodings as well as the architectural design of the Pyramid Vision Transformer.

Source: Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	50.00%
Semantic Segmentation	1	50.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Conditional Positional Encoding	Position Embeddings
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Spatially Separable Self-Attention	Attention Modules

Categories

Add Remove

Vision Transformers