Video Question Answering

150 papers with code • 20 benchmarks • 31 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	ST-LLM	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,691

computer-vision-in-the-wild/cvinw_r…

2 papers

994

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

TVQA+: Spatio-Temporal Grounding for Video Question Answering

jayleicn/TVQAplus • • ACL 2020

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

Paper
Code

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO • • EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Paper
Code

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

SUTDCV/SUTD-TrafficQA • • CVPR 2021

In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10, 080 in-the-wild videos and annotated 62, 535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios.

Paper
Code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter • 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Paper
Code

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m • • ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

Paper
Code

OmniNet: A unified architecture for multi-modal multi-task learning

subho406/OmniNet • • 17 Jul 2019

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

Paper
Code

A Better Way to Attend: Attention with Trees for Video Question Answering

xuehy/TreeAttention • • 5 Sep 2019

We propose a new attention model for video question answering.

Paper
Code

TutorialVQA: Question Answering Dataset for Tutorial Videos

acolas1/TutorialVQAData • LREC 2020

The results indicate that the task is challenging and call for the investigation of new algorithms.

Paper
Code

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

doc-doc/NExT-QA • • 18 May 2021

We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.

Paper
Code

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

stanlei52/tqvsr • • 30 Nov 2021

In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).

Paper
Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result