Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Latest papers with no code

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

no code yet • 16 Apr 2024

Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources.

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

no code yet • 21 Mar 2024

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities.

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

no code yet • 20 Mar 2024

Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data.

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

no code yet • 15 Mar 2024

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps.

Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging

no code yet • 12 Mar 2024

Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications.

Large Language Models are In-Context Molecule Learners

no code yet • 7 Mar 2024

Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning.

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

no code yet • 1 Mar 2024

Information retrieval is an ever-evolving and crucial research domain.

Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts

no code yet • 26 Feb 2024

However, the correlation between fonts and their impression is weak and unstable because impressions are subjective.

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

no code yet • 16 Feb 2024

Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters.

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

no code yet • 15 Feb 2024

Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal with the distribution shift, accompanied by the cross-modal alignment of a RS modality encoder, in an effort to extend the zero-shot capabilities of CLIP.