SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

1 May 2021  ·  Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Zhiwu Lu, Jun He, Xiaoyong Du ·

Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Place Recognition Oxford RobotCar Dataset SVT-Net AR@1 93.7 # 2
AR@1% 97.8 # 4
Point Cloud Retrieval Oxford RobotCar (LiDAR 4096 points) SVT-Net (refined) recall@top1% 98.4 # 6
recall@top1 94.7 # 6
Point Cloud Retrieval Oxford RobotCar (LiDAR 4096 points) SVT-Net (baseline) recall@top1% 97.8 # 12
recall@top1 93.7 # 10

Methods