Occluded Video Instance Segmentation: A Benchmark

Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis .

PDF Abstract

Datasets


Introduced in the Paper:

OVIS

Used in the Paper:

MS COCO YouTube-VIS 2019 YouTube-VIS 2021
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Instance Segmentation OVIS validation CMaskTrack R-CNN (ResNet-50) mask AP 15.4 # 39
AP50 33.9 # 36
AP75 13.1 # 39
APso 28.6 # 5
APmo 18.7 # 7
APho 4.1 # 8
Video Instance Segmentation OVIS validation CSipMask (ResNet-50) mask AP 14.3 # 42
AP50 29.9 # 40
AP75 12.5 # 40
APso 23 # 6
APmo 12.8 # 9
APho 2.7 # 9
Video Instance Segmentation YouTube-VIS validation CMaskTrack R-CNN mask AP 32.1 # 47
AP50 52.8 # 43
AP75 34.9 # 43
Video Instance Segmentation YouTube-VIS validation CSipMask mask AP 35.1 # 42
AP50 55.6 # 39
AP75 38.1 # 37

Methods


No methods listed for this paper. Add relevant methods here