SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation
Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.
PDF AbstractResults from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Referring Expression Segmentation | DAVIS 2017 (val) | RefVOS | J&F 1st frame | 45.1 | # 10 | ||
Referring Expression Segmentation | DAVIS 2017 (val) | RefVOS + SynthRef-YouTube-VIS | J&F 1st frame | 45.3 | # 9 | ||
J&F Full video | 44.8 | # 4 | |||||
Referring Expression Segmentation | Refer-YouTube-VOS | RefVOS-Human REs | Precision@0.5 | 38.6 | # 2 | ||
Precision@0.9 | 6.9 | # 1 | |||||
Mean IoU | 39.5 | # 1 | |||||
Referring Expression Segmentation | Refer-YouTube-VOS | RefVOS-Synthetic REs | Precision@0.5 | 32.3 | # 1 | ||
Precision@0.9 | 1.8 | # 2 | |||||
Mean IoU | 35.0 | # 2 |