Visual Dialog
54 papers with code • 8 benchmarks • 10 datasets
Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Libraries
Use these libraries to find Visual Dialog models and implementationsDatasets
Most implemented papers
Recursive Visual Attention in Visual Dialog
Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image.
Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation
Answerer in Questioner's Mind (AQM) is an information-theoretic framework that has been recently proposed for task-oriented dialog systems.
Discourse Parsing in Videos: A Multi-modal Appraoch
We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video.
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset.
Factor Graph Attention
We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities.
Reasoning Visual Dialogs with Structural and Partial Observations
The answer to a given question is represented by a node with missing value.
Improving Generative Visual Dialog by Answering Diverse Questions
Prior work on training generative Visual Dialog models with reinforcement learning(Das et al.) has explored a Qbot-Abot image-guessing game and shown that this 'self-talk' approach can lead to improved performance at the downstream dialog-conditioned image-guessing task.
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.
DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue
More importantly, we can tell which modality (visual or semantic) has more contribution in answering the current question by visualizing the gate values.
An Annotated Corpus of Reference Resolution for Interpreting Common Grounding
Common grounding is the process of creating, repairing and updating mutual understandings, which is a fundamental aspect of natural language conversation.