Browse SoTA > Miscellaneous > Multi-Modal > Cross-Modal Retrieval

Cross-Modal Retrieval

24 papers with code · Miscellaneous
Subtask of Multi-Modal

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

Source: Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Greatest papers with code

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

18 Jul 2017fartashf/vsepp

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL STRUCTURED PREDICTION

Stacked Cross Attention for Image-Text Matching

ECCV 2018 kuanghuei/SCAN

Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL TEXT MATCHING

Visual Semantic Reasoning for Image-Text Matching

ICCV 2019 KunpengLi1994/VSRN

It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO ([email protected] using 1K test set).

CROSS-MODAL RETRIEVAL IMAGE RETRIEVAL TEXT MATCHING

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

CVPR 2019 yalesong/pvse

In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.

CROSS-MODAL RETRIEVAL MULTIPLE INSTANCE LEARNING

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

CVPR 2020 cshizhe/hgr_v2t

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

CROSS-MODAL RETRIEVAL TEXT MATCHING

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

20 May 2020alibaba/EasyTransfer

In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.

CROSS-MODAL RETRIEVAL

Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images

CVPR 2019 hwang1996/ACME

Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.

CROSS-MODAL RETRIEVAL

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

22 Apr 2017csehong/VM-NET

Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.

CROSS-MODAL RETRIEVAL