Visual Dialog

54 papers with code • 8 benchmarks • 10 datasets

Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.

Libraries

Use these libraries to find Visual Dialog models and implementations

Latest papers with no code

Discourse Analysis for Evaluating Coherence in Video Paragraph Captions

no code yet • 17 Jan 2022

We also introduce DisNet, a novel dataset containing the proposed visual discourse annotations of 3000 videos and their paragraphs.

How to Fool Systems and Humans in Visually Grounded Interaction: A Case Study on Adversarial Attacks on Visual Dialog

no code yet • ACL ARR January 2022

Adversarial attacks change predictions of deep neural network models, while aiming to remain unnoticed by the user. This is a challenge for textual attacks, which target discrete text.

ViDA-MAN: Visual Dialog with Digital Humans

no code yet • 26 Oct 2021

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries.

Evaluating and Improving Interactions with Hazy Oracles

no code yet • 19 Oct 2021

Many AI systems integrate sensor inputs, world knowledge, and human-provided information to perform inference.

Variational Disentangled Attention for Regularized Visual Dialog

no code yet • 29 Sep 2021

One of the most important challenges in a visual dialog is to effectively extract the information from a given image and its historical conversation which are related to the current question.

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

no code yet • Findings (ACL) 2021

Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Question-aware I-Graph, which aims to capture the relations between objects in an image based on fully question representation.

Learning to Ground Visual Objects for Visual Dialog

no code yet • Findings (EMNLP) 2021

Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process.

Visual-Textual Alignment for Graph Inference in Visual Dialog

no code yet • COLING 2020

As a conversational intelligence task, visual dialog entails answering a series of questions grounded in an image, using the dialog history as context.

Reasoning Over History: Context Aware Visual Dialog

no code yet • EMNLP (nlpbt) 2020

While neural models have been shown to exhibit strong performance on single-turn visual question answering (VQA) tasks, extending VQA to a multi-turn, conversational setting remains a challenge.

Multi-Modal Open-Domain Dialogue

no code yet • EMNLP 2021

Recent work in open-domain conversational agents has demonstrated that significant improvements in model engagingness and humanness metrics can be achieved via massive scaling in both pre-training data and model size (Adiwardana et al., 2020; Roller et al., 2020).