Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

yellow-binary-tree/hawkeye 15 Mar 2024

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

23
15 Mar 2024

DAM: Dynamic Adapter Merging for Continual Video QA Learning

klauscc/dam 13 Mar 2024

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

8
13 Mar 2024

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

bigai-nlco/lstp-chat 25 Feb 2024

Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.

15
25 Feb 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

alpha-vllm/llama2-accessory 8 Feb 2024

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.

2,514
08 Feb 2024

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

Yui010206/CREMA 8 Feb 2024

Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities.

19
08 Feb 2024

YTCommentQA: Video Question Answerability in Instructional Videos

lgresearch/ytcommentqa 30 Jan 2024

Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning.

5
30 Jan 2024

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

yellow-binary-tree/STAIR 8 Jan 2024

However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.

3
08 Jan 2024

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

hoplee6/sports-qa 3 Jan 2024

Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval.

21
03 Jan 2024

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

byz0e/glance-focus NeurIPS 2023

Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage.

17
03 Jan 2024

A Simple LLM Framework for Long-Range Video Question-Answering

ceezh/llovi 28 Dec 2023

Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.

59
28 Dec 2023