Video Question Answering

151 papers with code • 20 benchmarks • 31 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

Latest papers with no code

Pegasus-v1 Technical Report

no code yet • 23 Apr 2024

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language.

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

no code yet • 18 Apr 2024

On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e. g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation.

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

no code yet • 9 Apr 2024

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

no code yet • 5 Apr 2024

Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA).

Koala: Key frame-conditioned long video-LLM

no code yet • 5 Apr 2024

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

no code yet • 1 Apr 2024

Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question.

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

no code yet • 1 Apr 2024

We will release our dataset, codes, and models to help future efforts in this domain.

VideoDistill: Language-aware Vision Distillation for Video Question Answering

no code yet • 1 Apr 2024

In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

no code yet • 21 Mar 2024

This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question.

VideoPrism: A Foundational Visual Encoder for Video Understanding

no code yet • 20 Feb 2024

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.