The Cross-Attention module is an attention module used in CrossViT for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\left(·\right)$ and $g\left(·\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.
Source: CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image ClassificationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Semantic Segmentation | 10 | 5.41% |
Object Detection | 7 | 3.78% |
Retrieval | 6 | 3.24% |
Image Classification | 6 | 3.24% |
Autonomous Driving | 5 | 2.70% |
Image Super-Resolution | 4 | 2.16% |
Super-Resolution | 4 | 2.16% |
Sentence | 4 | 2.16% |
Image Generation | 3 | 1.62% |