Browse SoTA > Computer Vision > Video Question Answering

Video Question Answering

13 papers with code · Computer Vision

Benchmarks

Greatest papers with code

OmniNet: A unified architecture for multi-modal multi-task learning

17 Jul 2019subho406/OmniNet

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

IMAGE CAPTIONING MULTI-TASK LEARNING PART-OF-SPEECH TAGGING QUESTION ANSWERING VIDEO CAPTIONING VIDEO QUESTION ANSWERING VISUAL QUESTION ANSWERING

TVQA: Localized, Compositional Video Question Answering

EMNLP 2018 jayleicn/TVQA

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

VIDEO QUESTION ANSWERING

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

ECCV 2018 antoine77340/howto100m

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

QUESTION ANSWERING SEMANTIC SIMILARITY SEMANTIC TEXTUAL SIMILARITY VIDEO QUESTION ANSWERING VIDEO RETRIEVAL VISUAL QUESTION ANSWERING

TVQA+: Spatio-Temporal Grounding for Video Question Answering

ACL 2020 jayleicn/TVQA-PLUS

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.

QUESTION ANSWERING VIDEO QUESTION ANSWERING

Hierarchical Conditional Relation Networks for Video Question Answering

CVPR 2020 thaolmk54/hcrn-videoqa

Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.

QUESTION ANSWERING VIDEO QUESTION ANSWERING VISUAL QUESTION ANSWERING

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

CVPR 2019 fanchenyou/HME-VideoQA

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention.

QUESTION ANSWERING VIDEO QUESTION ANSWERING VISUAL QUESTION ANSWERING

Visual Relation Grounding in Videos

ECCV 2020 doc-doc/vRGV

In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV).

QUESTION ANSWERING VIDEO QUESTION ANSWERING

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

6 Jun 2019MILVLG/activitynet-qa

It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA).

QUESTION ANSWERING VIDEO QUESTION ANSWERING VISUAL QUESTION ANSWERING

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

ACL 2020 hyounghk/VideoQADenseCapFrameGate-ACL2020

Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier.

IMAGE CAPTIONING MULTI-LABEL CLASSIFICATION QUESTION ANSWERING TEMPORAL LOCALIZATION VIDEO QUESTION ANSWERING