MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos

CVPR 2023  ·  Minghan Li, Shuai Li, Wangmeng Xiang, Lei Zhang ·

While impressive progress has been achieved, video instance segmentation (VIS) methods with per-clip input often fail on challenging videos with occluded objects and crowded scenes. This is mainly because instance queries in these methods cannot encode well the discriminative embeddings of instances, making the query-based segmenter difficult to distinguish those `hard' instances. To address these issues, we propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos. First, we initialize the positional embeddings and content features of object queries by considering their spatial contextual information and the inter-frame object motion. Second, we propose an inter-instance mask repulsion loss to distance each instance from its nearby non-target instances. The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos. In specific, MDQE with ResNet50 achieves 33.0\% and 44.5\% mask AP on OVIS and YouTube-VIS 2021, respectively. Code of MDQE can be found at \url{https://github.com/MinghanLi/MDQE_CVPR2023}.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Instance Segmentation OVIS validation MDQE(SwinL) mask AP 42.6 # 13
AP50 67.8 # 13
AP75 44.3 # 13
AR1 18.3 # 10
AR10 46.5 # 13
APso 65.1 # 3
APmo 49.3 # 4
APho 21.6 # 4
Video Instance Segmentation YouTube-VIS 2021 MDQE(Swin-L) mask AP 55.5 # 13
AP50 80.7 # 10
AP75 61.7 # 13
AR10 60.6 # 14
AR1 45.4 # 13
Video Instance Segmentation YouTube-VIS validation MDQE(Swin-L) mask AP 59.9 # 16
AP50 84.9 # 10
AP75 67.3 # 12
AR1 53.5 # 12
AR10 65.0 # 12

Methods