Cross-Attention Module

Introduced by Chen et al. in CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The Cross-Attention module is an attention module used in CrossViT for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\left(·\right)$ and $g\left(·\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.

Source: CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	10	5.41%
Object Detection	7	3.78%
Retrieval	6	3.24%
Image Classification	6	3.24%
Autonomous Driving	5	2.70%
Image Super-Resolution	4	2.16%
Super-Resolution	4	2.16%
Sentence	4	2.16%
Image Generation	3	1.62%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Concatenated Skip Connection	Skip Connections
Softmax	Output Functions

Categories

Add Remove

Attention Modules