ODTrack: Online Dense Temporal Token Learning for Visual Tracking
Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.
PDF AbstractCode
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Object Tracking | GOT-10k | ODTrack-B | Average Overlap | 77.0 | # 4 | |
Visual Object Tracking | GOT-10k | ODTrack-L | Average Overlap | 78.2 | # 3 | |
Visual Object Tracking | LaSOT | ODTrack-B | AUC | 73.2 | # 4 | |
Visual Object Tracking | LaSOT | ODTrack-L | AUC | 74.0 | # 1 | |
Visual Object Tracking | LaSOT-ext | ODTrack-L | AUC | 53.9 | # 2 | |
Visual Object Tracking | LaSOT-ext | ODTrack-B | AUC | 52.4 | # 6 | |
Visual Object Tracking | OTB-2015 | ODTrack-B | AUC | 0.723 | # 2 | |
Visual Object Tracking | OTB-2015 | ODTrack-L | AUC | 0.724 | # 1 | |
Visual Object Tracking | TNL2K | ODTrack-B | AUC | 60.9 | # 3 | |
Visual Object Tracking | TNL2K | ODTrack-L | AUC | 61.7 | # 1 | |
Visual Object Tracking | TrackingNet | ODTrack-L | Accuracy | 86.1 | # 1 | |
Visual Object Tracking | TrackingNet | ODTrack-B | Accuracy | 85.1 | # 7 | |
Semi-Supervised Video Object Segmentation | VOT2020 | ODTrack-B | EAO | 0.581 | # 7 | |
Semi-Supervised Video Object Segmentation | VOT2020 | ODTrack-L | EAO | 0.605 | # 3 |