|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.
Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).
Ranked #2 on Video Retrieval on MSR-VTT
We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos.
Ranked #2 on Video Question Answering on TVQA
Video question answering (VideoQA) is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts.
Ranked #1 on Visual Question Answering on MSVD-QA
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention.
Ranked #3 on Visual Question Answering on MSVD-QA
We present HERO, a novel framework for large-scale video+language omni-representation learning.
Ranked #1 on Video Retrieval on TVR
Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier.