Referring Expression Segmentation
66 papers with code • 25 benchmarks • 11 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
Datasets
Latest papers
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
UniVS: Unified and Universal Video Segmentation with Prompts as Queries
Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge.
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
We evaluate our unified models on various benchmarks.
General Object Foundation Model for Images and Videos at Scale
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment
Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone.
Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation
To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES
Universal Segmentation at Arbitrary Granularity with Language Instruction
This paper aims to achieve universal segmentation of arbitrary semantic level.
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.
NExT-Chat: An LMM for Chat, Detection and Segmentation
The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs).
GLaMM: Pixel Grounding Large Multimodal Model
In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.