Active Token Mixer

11 Mar 2022  ·  Guoqiang Wei, Zhizheng Zhang, Cuiling Lan, Yan Lu, Zhibo Chen ·

The three existing dominant network families, i.e., CNNs, Transformers, and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate flexible contextual information distributed across different channels from other tokens into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATM as the primary operator and assemble ATMs into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K ActiveMLP-L(UperNet) Validation mIoU 51.1 # 93
Params (M) 108 # 29
Object Detection COCO minival ActiveMLP-B (Cascade Mask R-CNN) box AP 52.3 # 64
Image Classification ImageNet ActiveMLP-L Top 1 Accuracy 84.8% # 270
Number of params 76.4M # 801
GFLOPs 36.4 # 404
Image Classification ImageNet ActiveMLP-T Top 1 Accuracy 82% # 530
Number of params 27.2M # 622
GFLOPs 4 # 191

Methods


No methods listed for this paper. Add relevant methods here