MAST: A Memory-Augmented Self-supervised Tracker

CVPR 2020  ·  Zihang Lai, Erika Lu, Weidi Xie ·

Recent interest in self-supervised dense tracking has yielded rapid progress, but performance still remains far from supervised methods. We propose a dense tracking model trained on videos without any annotations that surpasses previous self-supervised methods on existing benchmarks by a significant margin (+15%), and achieves performance comparable to supervised methods. In this paper, we first reassess the traditional choices used for self-supervised training and reconstruction loss by conducting thorough experiments that finally elucidate the optimal choices. Second, we further improve on existing methods by augmenting our architecture with a crucial memory component. Third, we benchmark on large-scale semi-supervised video object segmentation(aka. dense tracking), and propose a new metric: generalizability. Our first two contributions yield a self-supervised network that for the first time is competitive with supervised methods on standard evaluation metrics of dense tracking. When measuring generalizability, we show self-supervised approaches are actually superior to the majority of supervised methods. We believe this new generalizability metric can better capture the real-world use-cases for dense tracking, and will spur new interest in this research direction.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Unsupervised Video Object Segmentation DAVIS 2017 (val) MAST J&F 65.5 # 4
Jaccard (Mean) 63.3 # 4
Jaccard (Recall) 73.2 # 2
F-measure (Mean) 67.6 # 5
F-measure (Recall) 77.7 # 1
Semi-Supervised Video Object Segmentation DAVIS 2017 (val) MAST Jaccard (Mean) 63.3 # 63
Jaccard (Recall) 73.2 # 17
F-measure (Mean) 67.6 # 67
F-measure (Recall) 77.7 # 15
J&F 65.5 # 66

Methods


No methods listed for this paper. Add relevant methods here