Cross-Modal Retrieval

192 papers with code • 13 benchmarks • 21 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Benchmarks

Add a Result

These leaderboards are used to track progress in Cross-Modal Retrieval

Dataset	Best Model	Compare
COCO 2014	OURS-COMBINED-VAL	See all
Flickr30k	X2-VLM (large)	See all
Recipe1M	VLPCook (R1M+)	See all
RSICD	GeoRSCLIP-FT	See all
RSITMD	GeoRSCLIP-FT	See all
ChEBI-20	All-Ensemble	See all
Recipe1M+	VLPCook	See all
MSCOCO-1k	NAPReg	See all
SoundingEarth	GeoCLAP	See all
CUHK-PEDES	Dual Path	See all
Flickr-8k	NAPReg	See all
MS-COCO-2014	NAPReg	See all
MSCOCO	3SHNet	See all

Show all 13 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Cadene/recipe1m.bootstrap.pytorch

2 papers

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification

asuradayuci/tf-clip • • 15 Dec 2023

Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence.

15 Dec 2023

Paper
Code

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

aimagelab/safe-clip • 27 Nov 2023

We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.

27 Nov 2023

Paper
Code

Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

hcplab-sysu/personsearch-ctlg • • 15 Nov 2023

Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training.

15 Nov 2023

Paper
Code

Weakly supervised cross-modal learning in high-content screening

gwatkinson/jump_models • • 8 Nov 2023

With the surge in available data from various modalities, there is a growing need to bridge the gap between different data types.

08 Nov 2023

Paper
Code

BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping

mvrl/birdsat • • 29 Oct 2023

We propose a metadata-aware self-supervised learning~(SSL)~framework useful for fine-grained classification and ecological mapping of bird species around the world.

29 Oct 2023

Paper
Code

A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval

jaychempan/PIR • • ACMMM 2023

Our highlight is the proposal of a paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations.

27 Oct 2023

Paper
Code

InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

yimuwangcs/Better_Cross_Modal_Retrieval • • 20 Oct 2023

However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations.

20 Oct 2023

Paper
Code

Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks

yimuwangcs/Better_Cross_Modal_Retrieval • • 17 Oct 2023

In this work, we present a post-processing solution to address the hubness problem in cross-modal retrieval, a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance.

17 Oct 2023

Paper
Code

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

kyegomez/PALI3 • • 13 Oct 2023

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.

117

13 Oct 2023

Paper
Code

BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs

ryanwangzf/biobridge • • 5 Oct 2023

Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks.

05 Oct 2023

Paper
Code

Cross-Modal Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result