Referring Expression Segmentation
68 papers with code • 25 benchmarks • 11 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
Datasets
Most implemented papers
MAttNet: Modular Attention Network for Referring Expression Comprehension
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression.
Actor and Action Video Segmentation from a Sentence
This paper strives for pixel-level segmentation of actors and their actions in video content.
Referring Image Segmentation via Recurrent Refinement Networks
We address the problem of image segmentation from natural language descriptions.
Cross-Modal Self-Attention Network for Referring Image Segmentation
This module controls the information flow of features at different levels.
Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query
To address these issues, we propose an asymmetric cross-guided attention network for actor and action video segmentation from natural language query.
Referring Expression Object Segmentation with Caption-Aware Consistency
To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains.
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).
Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance.
PhraseCut: Language-based Image Segmentation in the Wild
We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77, 262 images and 345, 486 phrase-region pairs.
Referring Image Segmentation via Cross-Modal Progressive Comprehension
In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information.