Visual Grounding

178 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Grounding

Dataset	Best Model	Compare
RefCOCO+ testA	mPLUG-2	See all
RefCOCO+ test B	mPLUG-2	See all
RefCOCO+ val	mPLUG-2	See all

Libraries

Use these libraries to find Visual Grounding models and implementations

modelscope/modelscope

4 papers

6,079

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Collaborative Transformers for Grounded Situation Recognition

jhcho99/coformer • • CVPR 2022

To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation.

Paper
Code

SeqTR: A Simple yet Universal Network for Visual Grounding

sean-zhuh/seqtr • • 30 Mar 2022

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

Paper
Code

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

alibaba/AliceMind • • 24 May 2022

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.

Paper
Code

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

microsoft/SoM • • 17 Oct 2023

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

Paper
Code

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

qizekun/ShapeLLM • • 27 Feb 2024

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.

Paper
Code

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

gicheonkang/DAN-VisDial • • IJCNLP 2019

Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.

Paper
Code

A Fast and Accurate One-Stage Approach to Visual Grounding

zyang-ur/onestage_grounding • • ICCV 2019

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.

Paper
Code

Learning Cross-modal Context Graph for Visual Grounding

youngfly11/LCMCG-PyTorch • • 20 Nov 2019

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Paper
Code