BAEFormer: Bi-Directional and Early Interaction Transformers for Bird's Eye View Semantic Segmentation

Bird's Eye View (BEV) semantic segmentation is a critical task in autonomous driving. However, existing Transformer-based methods confront difficulties in transforming Perspective View (PV) to BEV due to their unidirectional and posterior interaction mechanisms. To address this issue, we propose a novel Bi-directional and Early Interaction Transformers framework named BAEFormer, consisting of (i) an early-interaction PV-BEV pipeline and (ii) a bi-directional cross-attention mechanism. Moreover, we find that the image feature maps' resolution in the cross-attention module has a limited effect on the final performance. Under this critical observation, we propose to enlarge the size of input images and downsample the multi-view image features for cross-interaction, further improving the accuracy while keeping the amount of computation controllable. Our proposed method for BEV semantic segmentation achieves state-of-the-art performance in real-time inference speed on the nuScenes dataset, i.e., 38.9 mIoU at 45 FPS on a single A100 GPU.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Bird's-Eye View Semantic Segmentation nuScenes BAEFormer IoU veh - 224x480 - No vis filter - 100x100 at 0.5 36 # 5
IoU veh - 448x800 - No vis filter - 100x100 at 0.5 37.8 # 5
IoU veh - 224x480 - Vis filter. - 100x100 at 0.5 38.9 # 6
IoU veh - 448x800 - Vis filter. - 100x100 at 0.5 41.0 # 5

Methods