( Image credit: Deep Visual-Semantic Alignments for Generating Image Descriptions )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years.
#31 best model for Machine Translation on WMT2014 English-French
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.
We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
#10 best model for Visual Question Answering on VQA v2 test-std
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.
In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.