CrossViT

Introduced by Chen et al. in CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT is a type of vision transformer that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a transformer) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.

Fusion is achieved by an efficient cross-attention module, in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.

Source: CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Sensor Fusion	1	25.00%
Adversarial Robustness	1	25.00%
General Classification	1	25.00%
Image Classification	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Cross-Attention Module	Attention Modules
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Vision Transformers

Image Models