Visual Grounding

181 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Grounding

Dataset	Best Model	Compare
RefCOCO+ testA	mPLUG-2	See all
RefCOCO+ test B	mPLUG-2	See all
RefCOCO+ val	mPLUG-2	See all

Libraries

Use these libraries to find Visual Grounding models and implementations

modelscope/modelscope

4 papers

6,190

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG • • ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

Paper
Code

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

yanmin-wu/eda • • CVPR 2023

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.

Paper
Code

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm • • 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

Paper
Code

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

linhuixiao/clip-vg • • 15 May 2023

In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.

Paper
Code

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE • • 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Paper
Code

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

jasonppy/syllable-discovery • • 19 May 2023

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Paper
Code

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm • • 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

Paper
Code

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

yifeisu/tg-gat • • 22 Aug 2023

This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023.

Paper
Code

InfMLLM: A Unified Framework for Visual-Language Tasks

infly-ai/inf-mllm • • 12 Nov 2023

Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding.

Paper
Code

Aligning and Prompting Everything All at Once for Universal Visual Perception

shenyunhang/ape • • 4 Dec 2023

However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding.

Paper
Code

Visual Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result