Visual Question Answering (VQA)

767 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

OmniFusion Technical Report

airi-institute/omnifusion 9 Apr 2024

We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality.

191
09 Apr 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

118
08 Apr 2024

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

faceonlive/ai-research 6 Apr 2024

In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception.

186
06 Apr 2024

Evaluating Text-to-Visual Generation with Image-to-Text Generation

linzhiqiu/t2v_metrics 1 Apr 2024

For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.

52
01 Apr 2024

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

atsumiyai/upd 29 Mar 2024

This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD).

33
29 Mar 2024

A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions

riken-grp/gazevqa 26 Mar 2024

Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information.

7
26 Mar 2024

Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering

digitalphonetics/intrinsic-subgraph-generation-for-vqa 26 Mar 2024

In this work, we introduce an interpretable approach for graph-based VQA and demonstrate competitive performance on the GQA dataset.

5
26 Mar 2024

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

csebuetnlp/illusionvqa 23 Mar 2024

GPT4V, the best-performing VLM, achieves 62. 99% accuracy (4-shot) on the comprehension task and 49. 7% on the localization task (4-shot and Chain-of-Thought).

3
23 Mar 2024

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

biomedia-mbzuai/medpromptx 22 Mar 2024

Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR).

45
22 Mar 2024

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

bowen-upenn/Multi-Agent-VQA 21 Mar 2024

This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks.

1
21 Mar 2024