Visual Grounding

181 papers with code • 3 benchmarks • 5 datasets

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

  • What is the main focus in a query?
  • How to understand an image?
  • How to locate an object?

Libraries

Use these libraries to find Visual Grounding models and implementations

Most implemented papers

TransVG: End-to-End Visual Grounding with Transformers

djiajunustc/TransVG ICCV 2021

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

yanmin-wu/eda CVPR 2023

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

linhuixiao/clip-vg 15 May 2023

In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

jasonppy/syllable-discovery 19 May 2023

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

yifeisu/tg-gat 22 Aug 2023

This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023.

InfMLLM: A Unified Framework for Visual-Language Tasks

infly-ai/inf-mllm 12 Nov 2023

Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding.

Aligning and Prompting Everything All at Once for Universal Visual Perception

shenyunhang/ape 4 Dec 2023

However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding.