Video Question Answering
154 papers with code • 20 benchmarks • 32 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Libraries
Use these libraries to find Video Question Answering models and implementationsLatest papers
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.
LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding
Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.
CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities.
YTCommentQA: Video Question Answerability in Instructional Videos
Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning.
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports
Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval.
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage.
A Simple LLM Framework for Long-Range Video Question-Answering
Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.