Image-to-Text Retrieval

28 papers with code • 8 benchmarks • 8 datasets

Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.

Libraries

Use these libraries to find Image-to-Text Retrieval models and implementations

Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Deep Visual-Semantic Alignments for Generating Image Descriptions

VinitSR7/Image-Caption-Generation CVPR 2015

Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

e-bug/volta 27 Jan 2022

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Exploring Models and Data for Remote Sensing Image Caption Generation

201528014227051/RSICD_optimal 21 Dec 2017

Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

BAAI-WuDao/BriVl 11 Mar 2021

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval

VL-Group/2022-NeurIPS-DAA NeurIPS 2022 2022

To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.