Vision Transformers

RegionViT consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.

Source: RegionViT: Regional-to-Local Attention for Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Action Recognition 1 25.00%
Image Classification 1 25.00%
Keypoint Detection 1 25.00%
Object Detection 1 25.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories