Vision Transformers

CrossViT is a type of vision transformer that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a transformer) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.

Fusion is achieved by an efficient cross-attention module, in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.

Source: CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Sensor Fusion 1 25.00%
Adversarial Robustness 1 25.00%
General Classification 1 25.00%
Image Classification 1 25.00%

Categories