Visual Dialog
54 papers with code • 8 benchmarks • 10 datasets
Visual Dialog requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the task is to answer the question.
Libraries
Use these libraries to find Visual Dialog models and implementationsDatasets
Latest papers
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Collecting Visually-Grounded Dialogue with A Game Of Sorts
We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts".
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data.
Unified Multimodal Model with Unlikelihood Training for Visual Dialog
Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers).
LAVIS: A Library for Language-Vision Intelligence
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.
Video Dialog as Conversation about Objects Living in Space-Time
To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time.
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
In this paper, we propose VD-PCR, a novel framework to improve Visual Dialog understanding with Pronoun Coreference Resolution in both implicit and explicit ways.
The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1. 2M to 12. 9M QA data).
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene
Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation, especially such as GuessWhich and GuessWhat, where the only image is visible by either and both of the questioner and the answerer, respectively.
UNITER-Based Situated Coreference Resolution with Rich Multimodal Input
Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73. 3% after model ensembling.