Video Question Answering
154 papers with code • 20 benchmarks • 32 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
Libraries
Use these libraries to find Video Question Answering models and implementationsMost implemented papers
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.
Revealing Single Frame Bias for Video-and-Language Learning
Training an effective video-and-language model intuitively requires multiple frames as model inputs.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Visual Causal Scene Refinement for Video Question Answering
Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.
PaLI-X: On Scaling up a Multilingual Vision and Language Model
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Video question-answering is a fundamental task in the field of video understanding.
Generative Pretraining in Multimodality
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.