Video Question Answering

151 papers with code • 20 benchmarks • 31 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	ST-LLM	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,722

computer-vision-in-the-wild/cvinw_r…

2 papers

1,000

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

Pegasus-v1 Technical Report

no code yet • 23 Apr 2024

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language.

Paper
Add Code

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

no code yet • 18 Apr 2024

On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e. g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation.

Paper
Add Code

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

no code yet • 9 Apr 2024

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.

Paper
Add Code

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

no code yet • 5 Apr 2024

Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA).

Paper
Add Code

Koala: Key frame-conditioned long video-LLM

no code yet • 5 Apr 2024

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.

Paper
Add Code

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

no code yet • 1 Apr 2024

Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question.

Paper
Add Code

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

no code yet • 1 Apr 2024

We will release our dataset, codes, and models to help future efforts in this domain.

Paper
Add Code

VideoDistill: Language-aware Vision Distillation for Video Question Answering

no code yet • 1 Apr 2024

In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.

Paper
Add Code

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

no code yet • 21 Mar 2024

This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question.

Paper
Add Code

VideoPrism: A Foundational Visual Encoder for Video Understanding

no code yet • 20 Feb 2024

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.

Paper
Add Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result