Visual Question Answering (VQA)

760 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

lixinustc/kvq-challenge-cvpr-ntire2024 17 Apr 2024

This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i. e., Kuaishou/Kwai Platform.

26
17 Apr 2024

MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

jiangsongtao/tinymed 16 Apr 2024

Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored.

7
16 Apr 2024

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

minhquan6203/vitextvqa-dataset 16 Apr 2024

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images.

5
16 Apr 2024

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

bzluan/textcot 15 Apr 2024

The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked.

13
15 Apr 2024

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

faceonlive/ai-research 12 Apr 2024

This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.

140
12 Apr 2024

OmniFusion Technical Report

airi-institute/omnifusion 9 Apr 2024

We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality.

180
09 Apr 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

109
08 Apr 2024

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

faceonlive/ai-research 6 Apr 2024

In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception.

140
06 Apr 2024

Evaluating Text-to-Visual Generation with Image-to-Text Generation

linzhiqiu/t2i_metrics 1 Apr 2024

For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.

47
01 Apr 2024

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

atsumiyai/upd 29 Mar 2024

This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD).

32
29 Mar 2024