Visual Dialog
54 papers with code • 8 benchmarks • 10 datasets
Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Libraries
Use these libraries to find Visual Dialog models and implementationsDatasets
Latest papers with no code
Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
We also introduce DisNet, a novel dataset containing the proposed visual discourse annotations of 3000 videos and their paragraphs.
How to Fool Systems and Humans in Visually Grounded Interaction: A Case Study on Adversarial Attacks on Visual Dialog
Adversarial attacks change predictions of deep neural network models, while aiming to remain unnoticed by the user. This is a challenge for textual attacks, which target discrete text.
ViDA-MAN: Visual Dialog with Digital Humans
We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries.
Evaluating and Improving Interactions with Hazy Oracles
Many AI systems integrate sensor inputs, world knowledge, and human-provided information to perform inference.
Variational Disentangled Attention for Regularized Visual Dialog
One of the most important challenges in a visual dialog is to effectively extract the information from a given image and its historical conversation which are related to the current question.
GoG: Relation-aware Graph-over-Graph Network for Visual Dialog
Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image based on fully question representation.
Learning to Ground Visual Objects for Visual Dialog
Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process.
Visual-Textual Alignment for Graph Inference in Visual Dialog
As a conversational intelligence task, visual dialog entails answering a series of questions grounded in an image, using the dialog history as context.
Reasoning Over History: Context Aware Visual Dialog
While neural models have been shown to exhibit strong performance on single-turn visual question answering (VQA) tasks, extending VQA to a multi-turn, conversational setting remains a challenge.
Multi-Modal Open-Domain Dialogue
Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020).