Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

magic-ai4med/kep • • 15 Apr 2024

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology.

15 Apr 2024

Paper
Code

Bridging Vision and Language Spaces with Assignment Prediction

park-jungin/vlap • • 15 Apr 2024

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.

15 Apr 2024

Paper
Code

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

hhc1997/l2rm • • 8 Mar 2024

To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs.

08 Mar 2024

Paper
Code

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

nohtow/wtf-rl • • 21 Feb 2024

Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.

21 Feb 2024

Paper
Code

Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

snowstormfly/cross-modal-retrieval-mlagt • • 3 Feb 2024

To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update.

03 Feb 2024

Paper
Code

Cross-modal Retrieval for Knowledge-based Visual Question Answering

paullerner/viquae • • 11 Jan 2024

Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base.

11 Jan 2024

Paper
Code

LeanVec: Searching vectors faster by making them fit

intellabs/vectorsearchdatasets • 26 Dec 2023

In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy.

26 Dec 2023

Paper
Code

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

wangzhecheng/skyscript • • 20 Dec 2023

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs).

20 Dec 2023

Paper
Code

TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification

asuradayuci/tf-clip • • 15 Dec 2023

Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence.

15 Dec 2023

Paper
Code

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

aimagelab/safe-clip • 27 Nov 2023

We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.

27 Nov 2023

Paper
Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result