Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Stacked Capsule Autoencoders

google-research/google-research • • NeurIPS 2019

In the second stage, SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses.

Paper
Code

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp • • 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Paper
Code

Rescaling Egocentric Vision

epic-kitchens/epic-kitchens-100-annotations • 23 Jun 2020

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS.

Paper
Code

Stacked Cross Attention for Image-Text Matching

kuanghuei/SCAN • • ECCV 2018

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

Paper
Code

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

dandelin/vilt • • 5 Feb 2021

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.

Paper
Code

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

salesforce/lavis • • NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Paper
Code

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

CryhanFang/CLIP2Video • • CVPR 2020

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Paper
Code

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

microsoft/Oscar • • ECCV 2020

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

Paper
Code

Probabilistic Embeddings for Cross-Modal Retrieval

naver-ai/pcme • • CVPR 2021

Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.

Paper
Code

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

kakaobrain/coyo-dataset • • 11 Feb 2021

In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset.

Paper
Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result