Cross-Modal Retrieval
190 papers with code • 12 benchmarks • 20 datasets
Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
References:
[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval
Libraries
Use these libraries to find Cross-Modal Retrieval models and implementationsDatasets
Subtasks
Latest papers with no code
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources.
VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition
Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities.
A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels
Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data.
Improving Medical Multi-modal Contrastive Learning with Expert Annotations
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps.
Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging
Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications.
Large Language Models are In-Context Molecule Learners
Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning.
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Information retrieval is an ever-evolving and crucial research domain.
Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts
However, the correlation between fonts and their impression is weak and unstable because impressions are subjective.
Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond
Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters.
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal with the distribution shift, accompanied by the cross-modal alignment of a RS modality encoder, in an effort to extend the zero-shot capabilities of CLIP.