It include two tasks: (1) Image as Query and Text as Targets; (2) Text as Query and Image as Targets.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
SOTA for Visual Question Answering on VQA v2
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
SOTA for Text-Image Retrieval on COCO
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner.
#2 best model for Text-Image Retrieval on COCO (image as query)