Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

Latest papers with no code

VideoPrism: A Foundational Visual Encoder for Video Understanding

no code yet • 20 Feb 2024

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.

Slot-VLM: SlowFast Slots for Video-Language Modeling

no code yet • 20 Feb 2024

A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs.

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

no code yet • 16 Feb 2024

We present Q-ViD, a simple approach for video question answering (video QA), that unlike prior methods, which are based on complex architectures, computationally expensive pipelines or use closed models like GPTs, Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

no code yet • 12 Feb 2024

This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

no code yet • 19 Jan 2024

GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs.

Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering

no code yet • 3 Jan 2024

While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research.

Perception Test 2023: A Summary of the First Challenge And Outcome

no code yet • 20 Dec 2023

The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.

Cross-Modal Reasoning with Event Correlation for Video Question Answering

no code yet • 20 Dec 2023

Video Question Answering (VideoQA) is a very attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i. e., the spatio-temporal video content and the word sequence in question.

Text-Conditioned Resampler For Long Form Video Understanding

no code yet • 19 Dec 2023

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task.

ViLA: Efficient Video-Language Alignment for Video Question Answering

no code yet • 13 Dec 2023

In this work, we propose an efficient Video-Language Alignment (ViLA) network.