Motion-Appearance Co-Memory Networks for Video Question Answering

Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer... (read more)

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract
No code implementations yet. Submit your code now
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering MSRVTT-QA Co-Mem Accuracy 0.32 # 4
Visual Question Answering MSVD-QA Co-Mem Accuracy 0.317 # 4

Methods used in the Paper


METHOD TYPE
Softmax
Output Functions
GRU
Recurrent Neural Networks
Dynamic Memory Network
Working Memory Models
Memory Network
Working Memory Models