Cross-Modal Retrieval

192 papers with code • 13 benchmarks • 21 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Most implemented papers

Dual-Path Convolutional Image-Text Embeddings with Instance Loss

layumi/Image-Text-Embedding 15 Nov 2017

In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.

Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images

LARC-CMU-SMU/ACME CVPR 2019

Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

dddzeng/CMR_audiovisual 10 Aug 2019

In particular, two significant contributions are made: i) a better representation by constructing deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace.

Visual Semantic Reasoning for Image-Text Matching

KunpengLi1994/VSRN ICCV 2019

It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO (Recall@1 using 1K test set).

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

mindspore-ai/models 1 Jul 2021

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

lionel-hing/bic-net 29 Oct 2021

The task of text-video retrieval aims to understand the correspondence between language and vision, has gained increasing attention in recent years.

An Empirical Study of Training End-to-End Vision-and-Language Transformers

zdou0830/meter CVPR 2022

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks.

Fusion and Orthogonal Projection for Improved Face-Voice Association

msaadsaeed/FOP 20 Dec 2021

Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks.

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

microsoft/unilm 22 Aug 2022

A big convergence of language, vision, and multimodal pretraining is emerging.