Cross-Modal Retrieval
192 papers with code • 13 benchmarks • 21 datasets
Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
References:
[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval
Libraries
Use these libraries to find Cross-Modal Retrieval models and implementationsDatasets
Subtasks
Most implemented papers
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language
Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality.
Deep Visual-Semantic Alignments for Generating Image Descriptions
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other.
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
A Channel Mix Method for Fine-Grained Cross-Modal Retrieval
In this paper, we propose a simple but effective method for dealing with the challenging fine-grained cross-modal retrieval task where it aims to enable flexible retrieval among subor-dinate categories across different modalities.
Order-Embeddings of Images and Language
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.
Deep Cross-Modal Hashing
Due to its low storage cost and fast query speed, cross-modal hashing (CMH) has been widely used for similarity search in multimedia retrieval applications.
Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions
We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation.
Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.