Cross-Modal Retrieval
190 papers with code • 12 benchmarks • 20 datasets
Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
References:
[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval
Libraries
Use these libraries to find Cross-Modal Retrieval models and implementationsDatasets
Subtasks
Latest papers
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology.
Bridging Vision and Language Spaces with Assignment Prediction
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.
Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval
To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs.
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.
Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization
To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update.
Cross-modal Retrieval for Knowledge-based Visual Question Answering
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base.
LeanVec: Searching vectors faster by making them fit
In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy.
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs).
TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification
Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence.
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.