Video Question Answering

154 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

Most implemented papers

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

showlab/demovlp 15 Mar 2022

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Revealing Single Frame Bias for Video-and-Language Learning

jayleicn/singularity 7 Jun 2022

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

opengvlab/internvideo 6 Dec 2022

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Visual Causal Scene Refinement for Video Question Answering

yangliu9208/vcsr 7 May 2023

Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

salesforce/lavis NeurIPS 2023

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

kyegomez/PALI 29 May 2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models

declare-lab/sealing 9 Jul 2023

Video question-answering is a fundamental task in the field of video understanding.

Generative Pretraining in Multimodality

baaivision/emu 11 Jul 2023

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.