Spectrum-guided Multi-granularity Referring Video Object Segmentation

ICCV 2023  ยท  Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian ยท

Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 times faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


 Ranked #1 on Referring Expression Segmentation on J-HMDB (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Segmentation A2D Sentences SgMg (Video-Swin-B) Precision@0.5 0.843 # 2
Precision@0.9 0.259 # 1
IoU overall 0.799 # 2
IoU mean 0.720 # 2
Precision@0.6 0.822 # 2
Precision@0.7 0.767 # 1
Precision@0.8 0.617 # 1
AP 0.585 # 1
Referring Expression Segmentation DAVIS 2017 (val) SgMg J&F 1st frame 63.3 # 3
Referring Expression Segmentation J-HMDB SgMg (Video-Swin-B) Precision@0.5 0.972 # 1
Precision@0.6 0.917 # 1
Precision@0.7 0.714 # 1
Precision@0.8 0.225 # 1
Precision@0.9 0.003 # 3
AP 0.450 # 1
IoU overall 0.737 # 1
IoU mean 0.725 # 1
Referring Expression Segmentation Refer-YouTube-VOS (2021 public validation) SgMg (Pre-training, Video-Swin-B) J&F 65.7 # 9
J 63.9 # 8
F 67.4 # 8

Methods


No methods listed for this paper. Add relevant methods here