Cross-Modal Retrieval is used for implementing a retrieval task across different modalities. such as image-text, video-text, and audio-text Cross-Modal Retrieval. The main challenge of Cross-Modal Retrieval is the modality gap and the key solution of Cross-Modal Retrieval is to generate new representations from different modalities in the shared subspace, such that new generated features can be applied in the computation of distance metrics, such as cosine distance and Euclidean distance.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step attentional process to capture limited number of semantic alignments which is less interpretable.
Ranked #3 on Cross-Modal Retrieval on COCO 2014
In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.
Ranked #5 on Text based Person Retrieval on CUHK-PEDES
It outperforms the current best method by 6. 8% relatively for image retrieval and 4. 8% relatively for caption retrieval on MS-COCO ([email protected] using 1K test set).
Ranked #2 on Cross-Modal Retrieval on COCO 2014
In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
Ranked #4 on Cross-Modal Retrieval on COCO 2014
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.
In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry.
Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.
Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa.