no code implementations • EMNLP 2018 • Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, Jiebo Luo
Most existing approaches adopt the pipeline of representing an image via pre-trained CNNs, and then using the uninterpretable CNN features in conjunction with the question to predict the answer.
no code implementations • CVPR 2017 • Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui
To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention.