DaViT: Dual Attention Vision Transformers

7 Apr 2022  ·  Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan ·

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K DaViT-T Validation mIoU 46.3 # 169
Semantic Segmentation ADE20K DaViT-B Validation mIoU 49.4 # 125
Semantic Segmentation ADE20K val DaViT-B (UperNet) mIoU 46.3 # 66
Semantic Segmentation ADE20K val DaViT-S (UperNet) mIoU 48.8 # 57
Image Classification ImageNet DaViT-T Top 1 Accuracy 82.8% # 453
Number of params 28.3M # 639
Image Classification ImageNet DaViT-S Number of params 49.7M # 722
Image Classification ImageNet DaViT-H Top 1 Accuracy 90.2% # 13
Number of params 362M # 925
GFLOPs 334 # 477
Image Classification ImageNet DaViT-L (ImageNet-22k) Top 1 Accuracy 87.5% # 86
Number of params 196.8M # 896
GFLOPs 103 # 451
Image Classification ImageNet DaViT-B (ImageNet-22k) Top 1 Accuracy 86.9% # 115
Number of params 87.9M # 830
GFLOPs 46.4 # 417
Image Classification ImageNet DaViT-B Top 1 Accuracy 84.6% # 288
Number of params 87.9M # 830
GFLOPs 15.5 # 341
Medical Image Classification ImageNet DaViT-S GFLOPs 8.8 # 2
Top 1 Accuracy 84.2% # 1
Medical Image Classification ImageNet DaViT-T GFLOPs 4.5 # 1
Image Classification ImageNet DaViT-G Top 1 Accuracy 90.4% # 12
Number of params 1437M # 958
GFLOPs 1038 # 489
Object Detection Object Detection on COCO minival DaViT-T (Mask R-CNN, 36 epochs) box AP 49.9 # 1
Instance Segmentation Object Detection on COCO minival DaViT-T (Mask R-CNN, 36 epochs) mask AP 44.3 # 1

Methods