Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.
VisDial v1.0 contains 123K dialogues on MS COCO (2017 training set) for training split, 2K dialogues with validation images for validation split and 8K dialogues on test set for test-standard set. The previously released v0.5 and v0.9 versions of VisDial dataset (corresponding to older splits of MS COCO) are considered deprecated.Source: Granular Multimodal Attention Networks for Visual Dialog