Video Question Answering

150 papers with code • 20 benchmarks • 31 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

Most implemented papers

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQAplus ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

SUTDCV/SUTD-TrafficQA CVPR 2021

In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10, 080 in-the-wild videos and annotated 62, 535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

A Better Way to Attend: Attention with Trees for Video Question Answering

xuehy/TreeAttention 5 Sep 2019

We propose a new attention model for video question answering.

TutorialVQA: Question Answering Dataset for Tutorial Videos

acolas1/TutorialVQAData LREC 2020

The results indicate that the task is challenging and call for the investigation of new algorithms.

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

doc-doc/NExT-QA 18 May 2021

We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

stanlei52/tqvsr 30 Nov 2021

In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).