VD-Ref is a dataset with ground-truth mappings from both noun phrases and pronouns to image regions. This dataset contains a set of 10k complete sets from the VisDialog dataset, and uses the StanfordCoreNLP tool to tokenize the sentences, making it proper for the succeeding human annotation.
Source: Extending Phrase Grounding with Pronouns in Visual DialoguesPaper | Code | Results | Date | Stars |
---|