Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the problem of accurate cross-media retrieval through image-sentence matching based on word-region alignments using supervision only at the global image-sentence level... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Cross-Modal Retrieval COCO 2014 TERAN MrSw Image-to-text [email protected] 55.6 # 2
Image-to-text [email protected] 83.9 # 2
Image-to-text [email protected] 91.6 # 2
Image Retrieval Flickr30K 1K test TERAN Symm. [email protected] 55.7 # 2
[email protected] 89.3 # 1
[email protected] 83.1 # 1
Image Retrieval Flickr30K 1K test TERAN MrSw [email protected] 56.5 # 1
[email protected] 88.2 # 2
[email protected] 81.2 # 3

Methods used in the Paper