Deep Visual-Semantic Alignments for Generating Image Descriptions

CVPR 2015 Andrej KarpathyLi Fei-Fei

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Text-Image Retrieval COCO DVSA [email protected] 80.5 # 1
Cross-Modal Retrieval COCO 2014 DVSA (R-CNN, AlexNet) Image-to-text [email protected] 38.4 # 4
Image-to-text [email protected] 69.9 # 4
Image-to-text [email protected] 80.5 # 4
Text-Image Retrieval COCO (image as query) DVSA [email protected] 74.8 # 3
Question Generation COCO Visual Question Answering (VQA) real images 1.0 open ended coco-Caption [[Karpathy and Li2014]] BLEU-1 62.5 # 2
Image Retrieval Flickr30K 1K test DVSA (R-CNN, AlexNet) [email protected] 15.2 # 10
[email protected] 50.5 # 9

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet