DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2X faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Object Tracking GOT-10k DropMAE Average Overlap 75.9 # 6
Success Rate 0.5 86.8 # 3
Success Rate 0.75 72 # 8
Visual Object Tracking ITB DropTrack AUC 0.65 # 1
Visual Object Tracking LaSOT DropTrack AUC 71.8 # 11
Normalized Precision 81.8 # 4
Precision 78.1 # 8
Visual Object Tracking LaSOT-ext DropTrack AUC 52.7 # 5
Precision 60.2 # 2
Visual Object Tracking TNL2K DropTrack precision 57.9 # 2
AUC 0.569 # 7
Visual Object Tracking TrackingNet DropTrack Normalized Precision 88.9 # 7
AUC 0.841 # 1

Methods