Cross-Modal Retrieval

192 papers with code • 13 benchmarks • 21 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all
MSCOCO	3SHNet	See all

Show all 13 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

no code yet • 15 Feb 2024

Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal with the distribution shift, accompanied by the cross-modal alignment of a RS modality encoder, in an effort to extend the zero-shot capabilities of CLIP.

Paper
Add Code

Large Language Models for Captioning and Retrieving Remote Sensing Images

no code yet • 9 Feb 2024

In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval.

Paper
Add Code

Cross-Modal Coordination Across a Diverse Set of Input Modalities

no code yet • 29 Jan 2024

Although this cross-modal coordination has been applied also to other pairwise combinations, extending it to an arbitrary number of diverse modalities is a problem that has not been fully explored in the literature.

Paper
Add Code

Enhancing medical vision-language contrastive learning via inter-matching relation modelling

no code yet • 19 Jan 2024

These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation.

Paper
Add Code

Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering

no code yet • 15 Jan 2024

Central to our focus is the utilizing of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements.

Paper
Add Code

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

no code yet • 27 Dec 2023

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.

Paper
Add Code

Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval

no code yet • 26 Dec 2023

However, due to task competition and information interference caused by significant differences between the inputs of the two proxy tasks, the effectiveness of representation learning for intra-modal and cross-modal features is limited.

Paper
Add Code

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

no code yet • 14 Dec 2023

Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs.

Paper
Add Code

WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

no code yet • 14 Dec 2023

The model is evaluated on two tasks: tag-based music retrieval and music auto-tagging.

Paper
Add Code

Uni3DL: Unified Model for 3D and Language Understanding

no code yet • 5 Dec 2023

In this work, we present Uni3DL, a unified model for 3D and Language understanding.

Paper
Add Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result