Visual Prompt Multi-Modal Tracking

CVPR 2023  ·  Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu ·

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Rgb-T Tracking LasHeR ViPT Precision 65.1 # 10
Success 52.5 # 10
Rgb-T Tracking RGBT234 ViPT Precision 83.5 # 12
Success 61.7 # 11

Methods


No methods listed for this paper. Add relevant methods here