Referring Expression Segmentation

68 papers with code • 25 benchmarks • 11 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

Latest papers with no code

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

no code yet • 26 Feb 2024

Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens.

RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

no code yet • 8 Feb 2024

This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.

Generalizable Entity Grounding via Assistance of Large Language Model

no code yet • 4 Feb 2024

In this work, we propose a novel approach to densely ground visual entities from a long caption.

Mask Grounding for Referring Image Segmentation

no code yet • 19 Dec 2023

To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects.

GSVA: Generalized Segmentation via Multimodal Large Language Models

no code yet • 15 Dec 2023

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image.

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

no code yet • 8 Dec 2023

During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects.

CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

no code yet • 17 Sep 2023

To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network.

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

no code yet • 18 Aug 2023

In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

no code yet • 19 Jun 2023

Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information.

Meta Compositional Referring Expression Segmentation

no code yet • CVPR 2023

Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions.