Video Question Answering
151 papers with code • 20 benchmarks • 31 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Libraries
Use these libraries to find Video Question Answering models and implementationsLatest papers with no code
Pegasus-v1 Technical Report
This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language.
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e. g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation.
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.
Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering
Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA).
Koala: Key frame-conditioned long video-LLM
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.
TraveLER: A Multi-LMM Agent Framework for Video Question-Answering
Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question.
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
We will release our dataset, codes, and models to help future efforts in this domain.
VideoDistill: Language-aware Vision Distillation for Video Question Answering
In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question.
VideoPrism: A Foundational Visual Encoder for Video Understanding
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.