Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	PLLaVA	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,768

computer-vision-in-the-wild/cvinw_r…

2 papers

1,011

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

showlab/demovlp • • 15 Mar 2022

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Paper
Code

Revealing Single Frame Bias for Video-and-Language Learning

jayleicn/singularity • • 7 Jun 2022

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Paper
Code

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM • • 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

Paper
Code

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm • • 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

Paper
Code

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

opengvlab/internvideo • • 6 Dec 2022

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Paper
Code

Visual Causal Scene Refinement for Video Question Answering

yangliu9208/vcsr • 7 May 2023

Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner.

Paper
Code

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

salesforce/lavis • • NeurIPS 2023

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI • • 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Paper
Code

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

declare-lab/sealing • • 9 Jul 2023

Video question-answering is a fundamental task in the field of video understanding.

Paper
Code

Generative Pretraining in Multimodality

baaivision/emu • • 11 Jul 2023

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.

Paper
Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result