Visual Question Answering (VQA)

763 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

Latest papers with no code

Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA

no code yet • 22 Apr 2024

Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content.

Adaptive Collaboration Strategy for LLMs in Medical Decision Making

no code yet • 22 Apr 2024

Our novel framework, Medical Decision-making Agents (MDAgents) aims to address this gap by automatically assigning the effective collaboration structure for LLMs.

Exploring Diverse Methods in Visual Question Answering

no code yet • 21 Apr 2024

This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms.

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

no code yet • 19 Apr 2024

Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes.

Unified Scene Representation and Reconstruction for 3D Large Language Models

no code yet • 19 Apr 2024

Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models.

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

no code yet • 19 Apr 2024

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data.

PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

no code yet • 19 Apr 2024

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles.

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

no code yet • 18 Apr 2024

On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e. g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation.

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

no code yet • 18 Apr 2024

Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process.

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

no code yet • 16 Apr 2024

2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information?