It include two tasks: (1) Image as Query and Text as Targets; (2) Text as Query and Image as Targets.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
Ranked #1 on Text-Image Retrieval on COCO (image as query)
In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space.
Ranked #1 on Cross-Modal Retrieval on Flickr30k
Our approach is based on a deep architecture that approximates the sorting of arbitrary sets of scores.
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
Ranked #1 on Text-Image Retrieval on COCO