Visual Grounding
178 papers with code • 3 benchmarks • 5 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:
- What is the main focus in a query?
- How to understand an image?
- How to locate an object?
Libraries
Use these libraries to find Visual Grounding models and implementationsMost implemented papers
Collaborative Transformers for Grounded Situation Recognition
To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation.
SeqTR: A Simple yet Universal Network for Visual Grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism.
A Fast and Accurate One-Stage Approach to Visual Grounding
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
Learning Cross-modal Context Graph for Visual Grounding
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Composing Pick-and-Place Tasks By Grounding Language
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images
Grounding referring expressions in RGBD image has been an emerging field.