Referring Expression Comprehension

67 papers with code • 8 benchmarks • 8 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Referring Expression Comprehension

Dataset	Best Model	Compare
RefCOCO	UNINEXT-H	See all
Talk2Car	Udeer_HuBo-VLM	See all
RefCoco+	ONE-PEACE	See all
RefCOCOg-val	ONE-PEACE	See all
RefCOCOg-test	UNINEXT-H	See all
CLEVR-Ref+	MDETR	See all
GRIT	Unified-IOXL	See all
VQDv1	Vision+Query	See all

Libraries

Use these libraries to find Referring Expression Comprehension models and implementations

modelscope/modelscope

2 papers

6,000

Datasets

Most implemented papers

Most implemented Social Latest No code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task • • NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Paper
Code

Compositional Attention Networks for Machine Reasoning

stanfordnlp/mac-network • • ICLR 2018

We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.

Paper
Code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

idea-research/groundingdino • • 9 Mar 2023

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

Paper
Code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa • • 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Paper
Code

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref • • CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

Paper
Code

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT • • ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Paper
Code

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr • • 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Paper
Code

SeqTR: A Simple yet Universal Network for Visual Grounding

sean-zhuh/seqtr • • 30 Mar 2022

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

Paper
Code

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

microsoft/SoM • • 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

Paper
Code

Referring Expression Comprehension

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result