Twins-PCPVT

Introduced by Chu et al. in Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins-PCPVT is a type of vision transformer that combines global attention, specifically the global sub-sampled attention as proposed in Pyramid Vision Transformer, with conditional position encodings (CPE) to replace the absolute position encodings used in PVT.

The position encoding generator (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D depth-wise convolution without batch normalization. For image-level classification, following CPVT, the class token is removed and global average pooling is used at the end of the stage. For other vision tasks, the design of PVT is followed.

Source: Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	50.00%
Semantic Segmentation	1	50.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Conditional Positional Encoding	Position Embeddings
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Vision Transformers