Cross-Modal Retrieval

192 papers with code • 13 benchmarks • 21 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all
MSCOCO	3SHNet	See all

Show all 13 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Dual-Path Convolutional Image-Text Embeddings with Instance Loss

layumi/Image-Text-Embedding • • 15 Nov 2017

In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.

Paper
Code

Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images

LARC-CMU-SMU/ACME • • CVPR 2019

Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.

Paper
Code

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

dddzeng/CMR_audiovisual • • 10 Aug 2019

In particular, two significant contributions are made: i) a better representation by constructing deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace.

Paper
Code

Visual Semantic Reasoning for Image-Text Matching

KunpengLi1994/VSRN • • ICCV 2019

It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).

Paper
Code

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models • • 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Paper
Code

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler • • 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

Paper
Code