Referring Transformer: A One-step Approach to Multi-task Visual Grounding

NeurIPS 2021  ·  Muchen Li, Leonid Sigal ·

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Referring Expression Comprehension RefCOCO RefTR-PT Val 85.65 # 11
Test A 88.73 # 10
Test B 81.16 # 10
Referring Expression Comprehension RefCOCO RefTR Val 82.23 # 13
Test A 85.59 # 13
Test B 76.57 # 12
Referring Expression Segmentation RefCOCO testA RefTR Overall IoU 73.49 # 12
Referring Expression Segmentation RefCOCO testB RefTR Overall IoU 66.57 # 10
Referring Expression Segmentation RefCoCo val RefTR Overall IoU 70.56 # 13

Methods


No methods listed for this paper. Add relevant methods here