Vision Transformers

Twins-SVT is a type of vision transformer which utilizes a spatially separable attention mechanism (SSAM) which is composed of two types of attention operations—(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes conditional position encodings as well as the architectural design of the Pyramid Vision Transformer.

Source: Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 1 50.00%
Semantic Segmentation 1 50.00%

Categories