Cross-Modal Retrieval
192 papers with code • 13 benchmarks • 21 datasets
Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
References:
[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval
Libraries
Use these libraries to find Cross-Modal Retrieval models and implementationsDatasets
Subtasks
Latest papers
TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification
Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence.
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search
Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training.
Weakly supervised cross-modal learning in high-content screening
With the surge in available data from various modalities, there is a growing need to bridge the gap between different data types.
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
We propose a metadata-aware self-supervised learning~(SSL)~framework useful for fine-grained classification and ecological mapping of bird species around the world.
A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval
Our highlight is the proposal of a paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations.
InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations.
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
In this work, we present a post-processing solution to address the hubness problem in cross-modal retrieval, a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance.
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.
BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs
Foundation models (FMs) are able to leverage large volumes of unlabeled data to demonstrate superior performance across a wide range of tasks.