Vision Transformers

Twins-PCPVT is a type of vision transformer that combines global attention, specifically the global sub-sampled attention as proposed in Pyramid Vision Transformer, with conditional position encodings (CPE) to replace the absolute position encodings used in PVT.

The position encoding generator (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D depth-wise convolution without batch normalization. For image-level classification, following CPVT, the class token is removed and global average pooling is used at the end of the stage. For other vision tasks, the design of PVT is followed.

Source: Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 1 50.00%
Semantic Segmentation 1 50.00%

Categories