Cross-Modal Retrieval

190 papers with code • 12 benchmarks • 20 datasets

Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.

References:

[1] Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

[2] Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Libraries

Use these libraries to find Cross-Modal Retrieval models and implementations

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

magic-ai4med/kep 15 Apr 2024

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology.

5
15 Apr 2024

Bridging Vision and Language Spaces with Assignment Prediction

park-jungin/vlap 15 Apr 2024

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world.

4
15 Apr 2024

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

hhc1997/l2rm 8 Mar 2024

To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs.

11
08 Mar 2024

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

nohtow/wtf-rl 21 Feb 2024

Secondly, they can serve as additional trajectories in the RL strategy, resulting in a teacher forcing loss weighted by the similarity of the GT to the image.

4
21 Feb 2024

Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

snowstormfly/cross-modal-retrieval-mlagt 3 Feb 2024

To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update.

3
03 Feb 2024

Cross-modal Retrieval for Knowledge-based Visual Question Answering

paullerner/viquae 11 Jan 2024

Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base.

23
11 Jan 2024

LeanVec: Searching vectors faster by making them fit

intellabs/vectorsearchdatasets 26 Dec 2023

In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy.

8
26 Dec 2023

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

wangzhecheng/skyscript 20 Dec 2023

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs).

66
20 Dec 2023

TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification

asuradayuci/tf-clip 15 Dec 2023

Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence.

28
15 Dec 2023

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

aimagelab/safe-clip 27 Nov 2023

We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator.

12
27 Nov 2023