MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is publicly available.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K tiny-MOAT-2 (IN-1K pretraining, single scale) Validation mIoU 44.9 # 190
Params (M) 13 # 59
Semantic Segmentation ADE20K tiny-MOAT-3 (IN-1K pretraining, single scale) Validation mIoU 47.5 # 155
Params (M) 24 # 55
Semantic Segmentation ADE20K MOAT-2 (IN-22K pretraining, single-scale) Validation mIoU 54.7 # 49
Params (M) 81 # 35
Semantic Segmentation ADE20K MOAT-3 (IN-22K pretraining, single-scale) Validation mIoU 56.5 # 33
Params (M) 198 # 21
Semantic Segmentation ADE20K MOAT-4 (IN-22K pretraining, single-scale) Validation mIoU 57.6 # 24
Params (M) 496 # 10
Semantic Segmentation ADE20K tiny-MOAT-0 (IN-1K pretraining, single scale) Validation mIoU 41.2 # 208
Params (M) 6 # 63
Semantic Segmentation ADE20K tiny-MOAT-1 (IN-1K pretraining, single scale) Validation mIoU 43.1 # 204
Params (M) 8 # 61
Instance Segmentation COCO minival tiny-MOAT-0 (IN-1K pretraining, single-scale) mask AP 43.3 # 58
Instance Segmentation COCO minival tiny-MOAT-1 (IN-1K pretraining, single-scale) mask AP 44.6 # 49
Instance Segmentation COCO minival tiny-MOAT-2 (IN-1K pretraining, single-scale) mask AP 45.0 # 47
Instance Segmentation COCO minival tiny-MOAT-3 (IN-1K pretraining, single-scale) mask AP 47.0 # 38
Instance Segmentation COCO minival MOAT-0 (IN-1K pretraining, single-scale) mask AP 47.4 # 36
Instance Segmentation COCO minival MOAT-1 (IN-1K pretraining, single-scale) mask AP 49.0 # 27
Instance Segmentation COCO minival MOAT-2 (IN-22K pretraining, single-scale) mask AP 49.3 # 26
Instance Segmentation COCO minival MOAT-3 (IN-22K pretraining, single-scale) mask AP 50.3 # 23
Object Detection COCO minival tiny-MOAT-1 (IN-1K pretraining, single-scale) box AP 51.9 # 67
Object Detection COCO minival tiny-MOAT-0 (IN-1K pretraining, single-scale) box AP 50.5 # 74
Object Detection COCO minival tiny-MOAT-2 (IN-1K pretraining, single-scale) box AP 53.0 # 61
Object Detection COCO minival MOAT-2 (IN-22K pretraining, single-scale) box AP 58.5 # 33
Object Detection COCO minival MOAT-1 (IN-1K pretraining, single-scale) box AP 57.7 # 36
Object Detection COCO minival MOAT-0 (IN-1K pretraining, single-scale) box AP 55.9 # 45
Object Detection COCO minival tiny-MOAT-3 (IN-1K pretraining, single-scale) box AP 55.2 # 48
Object Detection COCO minival MOAT-3 (IN-22K pretraining, single-scale) box AP 59.2 # 28
Image Classification ImageNet MOAT-3 1K only Top 1 Accuracy 86.7% # 126
Number of params 190M # 890
GFLOPs 271 # 473
Image Classification ImageNet MOAT-0 1K only Top 1 Accuracy 83.3% # 403
Number of params 27.8M # 627
GFLOPs 5.7 # 237
Image Classification ImageNet MOAT-4 22K+1K Top 1 Accuracy 89.1% # 33
Number of params 483.2M # 940
GFLOPs 648.5 # 485
Image Classification ImageNet V2 MOAT-2 (IN-22K pretraining) Top 1 Accuracy 79.3 # 11
Image Classification ImageNet V2 MOAT-1 (IN-22K pretraining) Top 1 Accuracy 78.4 # 12
Image Classification ImageNet V2 MOAT-4 (IN-22K pretraining) Top 1 Accuracy 81.5 # 8
Image Classification ImageNet V2 MOAT-3 (IN-22K pretraining) Top 1 Accuracy 80.6 # 10
Object Detection MS COCO MOAT-3 22K+1K box AP 59.2 # 1
Object Detection MS COCO MOAT-2 box AP 58.5 # 2

Methods