Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Benchmarks

Add a Result

These leaderboards are used to track progress in Image-to-Text Retrieval

Dataset	Best Model	Compare
Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	See all
MS COCO	BLIP-2 ViT-G (fine-tuned)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Text-only FT)	See all
AIC-ICC	ERNIE-ViL2.0	See all
RUC-CAS-WenLan	CMCL	See all
Localized Narratives	OPT	See all
FETA Car-Manuals	FETA's CLIP-MIL (Many-Shot Image-to-text)	See all
RSICD	GeoRSCLIP-FT	See all

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

facebookresearch/multimodal

3 papers

1,307

huggingface/transformers

2 papers

125,725

salesforce/lavis

2 papers

8,821

Datasets

Most implemented papers

Most implemented Social Latest No code

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE • • 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Paper
Code

Vision-Language Dataset Distillation

princetonvisualai/multimodal_dataset_distillation • • 15 Aug 2023

In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching.

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Paper
Code

Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

alirezamshi/AME-CMR • • EMNLP (WS) 2019

In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages.

Paper
Code

Learning Relation Alignment for Calibrated Cross-modal Retrieval

lancopku/IAIS • • ACL 2021

To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions.

Paper
Code

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

m2man/LGSGM • • 4 Jun 2021

In this paper, we introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method by integrating an extra graph convolution network to capture the general information of a graph.

Paper
Code

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

fundamentalvision/Uni-Perceiver • • 9 Jun 2022

To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.

Paper
Code

Design of the topology for contrastive visual-textual alignment

minogame/clip-mtob • • 5 Sep 2022

Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning.

Paper
Code

FETA: Towards Specializing Foundation Models for Expert Task Applications

alfassy/FETA • • 8 Sep 2022

However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.

Paper
Code

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

PaddlePaddle/ERNIE • • 30 Sep 2022

They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality.

Paper
Code

Image-to-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result