Learning Unsupervised Video Object Segmentation Through Visual Attention

This paper conducts a systematic study on the role of visual attention in Unsupervised Video Object Segmentation (UVOS) tasks. By elaborately annotating three popular video segmentation datasets (DAVIS, Youtube-Objects and SegTrack V2) with dynamic eye-tracking data in the UVOS setting, for the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgements during dynamic, task-driven viewing. Such novel observations provide an in-depth insight into the underlying rationale behind UVOS. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major merits: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance in comparison with state-of-the-arts.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Unsupervised Video Object Segmentation DAVIS 2016 val AGS G 78.6 # 21
J 79.7 # 21
F 77.4 # 20
Unsupervised Video Object Segmentation DAVIS 2017 (test-dev) AGS J&F 45.6 # 3
Jaccard (Mean) 42.1 # 2
Jaccard (Recall) 48.5 # 2
Jaccard (Decay) 2.6 # 2
F-measure (Mean) 49.0 # 2
F-measure (Recall) 51.5 # 2
F-measure (Decay) 2.6 # 2
Unsupervised Video Object Segmentation DAVIS 2017 (val) AGS J&F 57.5 # 8
Jaccard (Mean) 55.5 # 8
Jaccard (Recall) 61.6 # 6
F-measure (Mean) 59.5 # 8
F-measure (Recall) 62.8 # 6
Unsupervised Video Object Segmentation YouTube-Objects AGS J 69.7 # 8

Methods