Phrase Grounding

36 papers with code • 5 benchmarks • 6 datasets

Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.

Source: Phrase Grounding by Soft-Label Chain Conditional Random Field

Benchmarks

Add a Result

These leaderboards are used to track progress in Phrase Grounding

Dataset	Best Model	Compare
Flickr30k Entities Test	GLIPv2	See all
Visual Genome	GbS VG	See all
Flickr30k	GBS Ensemble + 12-in-1	See all
ReferIt	VG_BiLSTM_VGG	See all
Flickr30k Entities Dev	Fiber-B	See all

Libraries

Use these libraries to find Phrase Grounding models and implementations

microsoft/GLIP

2 papers

1,951

Datasets

Subtasks

Grounded Open Vocabulary Acquisition

Most implemented papers

Most implemented Social Latest No code

Grounding of Textual Phrases in Images by Reconstruction

akirafukui/vqa-mcb • • 12 Nov 2015

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Paper
Code

Revisiting Image-Language Networks for Open-ended Phrase Detection

BryanPlummer/phrase_detection • • 17 Nov 2018

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image.

Paper
Code

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr • • 26 Apr 2021

We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting.

Paper
Code

Kosmos-2: Grounding Multimodal Large Language Models to the World

microsoft/unilm • • 26 Jun 2023

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.

Paper
Code

Conditional Image-Text Embedding Networks

BryanPlummer/cite • • ECCV 2018

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model.

Paper
Code

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

xiangchenchao/ddpn • • 9 May 2018

Visual grounding aims to localize an object in an image referred to by a textual query phrase.

Paper
Code

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

hassanhub/MultiGrounding • • CVPR 2019

Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.

Paper
Code

Modularized Textual Grounding for Counterfactual Resilience

jacobswan1/MTG-pytorch • • CVPR 2019

Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries.

Paper
Code

Zero-Shot Grounding of Objects from Natural Language Queries

TheShadow29/zsgnet-pytorch • • ICCV 2019

A phrase grounding system localizes a particular object in an image referred to by a natural language query.

Paper
Code

Phrase Grounding by Soft-Label Chain Conditional Random Field

liujch1998/SoftLabelCCRF • • IJCNLP 2019

In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions.

Paper
Code

Phrase Grounding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result