Cross-Modal Retrieval

192 papers with code • 13 benchmarks • 21 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Latest papers with no code

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

no code yet • 15 Feb 2024

Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal with the distribution shift, accompanied by the cross-modal alignment of a RS modality encoder, in an effort to extend the zero-shot capabilities of CLIP.

Large Language Models for Captioning and Retrieving Remote Sensing Images

no code yet • 9 Feb 2024

In this work, we propose RS-CapRet, a Vision and Language method for remote sensing tasks, in particular image captioning and text-image retrieval.

Cross-Modal Coordination Across a Diverse Set of Input Modalities

no code yet • 29 Jan 2024

Although this cross-modal coordination has been applied also to other pairwise combinations, extending it to an arbitrary number of diverse modalities is a problem that has not been fully explored in the literature.

Enhancing medical vision-language contrastive learning via inter-matching relation modelling

no code yet • 19 Jan 2024

These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation.

Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering

no code yet • 15 Jan 2024

Central to our focus is the utilizing of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements.

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

no code yet • 27 Dec 2023

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.

Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval

no code yet • 26 Dec 2023

However, due to task competition and information interference caused by significant differences between the inputs of the two proxy tasks, the effectiveness of representation learning for intra-modal and cross-modal features is limited.

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

no code yet • 14 Dec 2023

Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs.

WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

no code yet • 14 Dec 2023

The model is evaluated on two tasks: tag-based music retrieval and music auto-tagging.

Uni3DL: Unified Model for 3D and Language Understanding

no code yet • 5 Dec 2023

In this work, we present Uni3DL, a unified model for 3D and Language understanding.