Referring Expression Segmentation

68 papers with code • 25 benchmarks • 11 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Most implemented papers

MAttNet: Modular Attention Network for Referring Expression Comprehension

lichengunc/MAttNet CVPR 2018

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.

Actor and Action Video Segmentation from a Sentence

JerryX1110/awesome-rvos CVPR 2018

This paper strives for pixel-level segmentation of actors and their actions in video content.

Referring Image Segmentation via Recurrent Refinement Networks

liruiyu/referseg_rrn CVPR 2018

We address the problem of image segmentation from natural language descriptions.

Cross-Modal Self-Attention Network for Referring Image Segmentation

lwye/CMSA-Net CVPR 2019

This module controls the information flow of features at different levels.

Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query

haowang1992/ACGA ICCV 2019

To address these issues, we propose an asymmetric cross-guided attention network for actor and action video segmentation from natural language query.

Referring Expression Object Segmentation with Caption-Aware Consistency

wenz116/lang2seg 10 Oct 2019

To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains.

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

luogen1996/MCN CVPR 2020

In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).

Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

ilkerkesen/bvpr 28 Mar 2020

Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance.

PhraseCut: Language-based Image Segmentation in the Wild

ChenyunWu/PhraseCutDataset CVPR 2020

We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77, 262 images and 345, 486 phrase-region pairs.

Referring Image Segmentation via Cross-Modal Progressive Comprehension

spyflying/CMPC-Refseg CVPR 2020

In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information.