Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

no code yet • 16 Apr 2024

Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources.

Paper
Add Code

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

no code yet • 21 Mar 2024

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities.

Paper
Add Code

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

no code yet • 20 Mar 2024

Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data.

Paper
Add Code

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

no code yet • 15 Mar 2024

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps.

Paper
Add Code

Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging

no code yet • 12 Mar 2024

Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications.

Paper
Add Code

Large Language Models are In-Context Molecule Learners

no code yet • 7 Mar 2024

Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning.

Paper
Add Code

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

no code yet • 1 Mar 2024

Information retrieval is an ever-evolving and crucial research domain.

Paper
Add Code

Impression-CLIP: Contrastive Shape-Impression Embedding for Fonts

no code yet • 26 Feb 2024

However, the correlation between fonts and their impression is weak and unstable because impressions are subjective.

Paper
Add Code

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

no code yet • 16 Feb 2024

Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters.

Paper
Add Code

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

no code yet • 15 Feb 2024

Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal with the distribution shift, accompanied by the cross-modal alignment of a RS modality encoder, in an effort to extend the zero-shot capabilities of CLIP.

Paper
Add Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result