Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	PLLaVA	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,774

computer-vision-in-the-wild/cvinw_r…

2 papers

1,012

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

yellow-binary-tree/hawkeye • • 15 Mar 2024

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

15 Mar 2024

Paper
Code

DAM: Dynamic Adapter Merging for Continual Video QA Learning

klauscc/dam • • 13 Mar 2024

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

13 Mar 2024

Paper
Code

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

bigai-nlco/lstp-chat • • 25 Feb 2024

Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.

25 Feb 2024

Paper
Code

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

alpha-vllm/llama2-accessory • • 8 Feb 2024

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.

2,514

08 Feb 2024

Paper
Code

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

Yui010206/CREMA • • 8 Feb 2024

Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities.

08 Feb 2024

Paper
Code

YTCommentQA: Video Question Answerability in Instructional Videos

lgresearch/ytcommentqa • 30 Jan 2024

Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning.

30 Jan 2024

Paper
Code

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

yellow-binary-tree/STAIR • • 8 Jan 2024

However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.

08 Jan 2024

Paper
Code

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

hoplee6/sports-qa • 3 Jan 2024

Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval.

03 Jan 2024

Paper
Code

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

byz0e/glance-focus • • NeurIPS 2023

Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage.

03 Jan 2024

Paper
Code

A Simple LLM Framework for Long-Range Video Question-Answering

ceezh/llovi • • 28 Dec 2023

Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.

28 Dec 2023

Paper
Code

Video Question Answering

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result