AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation
Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance.
PDF AbstractDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Semantic Segmentation | iSAID | AerialFormer-B | mIoU | 69.3 | # 3 | |
Semantic Segmentation | iSAID | AerialFormer-T | mIoU | 67.5 | # 9 | |
Semantic Segmentation | iSAID | AerialFormer-S | mIoU | 68.4 | # 5 | |
Semantic Segmentation | ISPRS Potsdam | AerialFormer-B | Overall Accuracy | 93.9 | # 1 | |
Mean F1 | 94.1 | # 1 | ||||
Mean IoU | 89.1 | # 1 | ||||
Semantic Segmentation | LoveDA | AerialFormer-B | Category mIoU | 54.1 | # 4 |