Referring Expression Segmentation
68 papers with code • 25 benchmarks • 11 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
Datasets
Latest papers
GLaMM: Pixel Grounding Large Multimodal Model
In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
Towards Omni-supervised Referring Expression Segmentation
To address this issue, we propose a new learning task for RES called Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data, e. g., referring points or grounding boxes, for efficient RES training.
Tracking Anything with Decoupled Video Segmentation
To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions.
Referring Image Segmentation Using Text Supervision
Hence, we propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process to differentiate between positive and negative text expressions.
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
Next, we propose an Expression Alignment (EA) mechanism for audio and text expressions.
Spectrum-guided Multi-granularity Referring Video Object Segmentation
To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks.
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
Parameter Efficient Tuning (PET) has gained attention for reducing the number of parameters while maintaining performance and providing better hardware resource savings, but few studies investigate dense prediction tasks and interaction between modalities.
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction.