Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	PLLaVA	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,780

computer-vision-in-the-wild/cvinw_r…

2 papers

1,012

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

VideoPrism: A Foundational Visual Encoder for Video Understanding

no code yet • 20 Feb 2024

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.

Paper
Add Code

Slot-VLM: SlowFast Slots for Video-Language Modeling

no code yet • 20 Feb 2024

A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs.

Paper
Add Code

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

no code yet • 16 Feb 2024

We present Q-ViD, a simple approach for video question answering (video QA), that unlike prior methods, which are based on complex architectures, computationally expensive pipelines or use closed models like GPTs, Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.

Paper
Add Code

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

no code yet • 12 Feb 2024

This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.

Paper
Add Code

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

no code yet • 19 Jan 2024

GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs.

Paper
Add Code

Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering

no code yet • 3 Jan 2024

While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research.

Paper
Add Code

Perception Test 2023: A Summary of the First Challenge And Outcome

no code yet • 20 Dec 2023

The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.

Paper
Add Code

Cross-Modal Reasoning with Event Correlation for Video Question Answering

no code yet • 20 Dec 2023

Video Question Answering (VideoQA) is a very attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i. e., the spatio-temporal video content and the word sequence in question.

Paper
Add Code

Text-Conditioned Resampler For Long Form Video Understanding

no code yet • 19 Dec 2023

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task.

Paper
Add Code

ViLA: Efficient Video-Language Alignment for Video Question Answering

no code yet • 13 Dec 2023

In this work, we propose an efficient Video-Language Alignment (ViLA) network.

Paper
Add Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result