Dual-CNN: A Convolutional language decoder for paragraph image captioning

Abstract The task of paragraph image captioning aims to generate a coherent paragraph describing a given image. However, due to their limited ability to capture long-term dependency, recurrent neural network or long-short term memory based decoders could hardly generate satisfactory textual descriptions with a long paragraph. In addition, the training inefficiency in the sequential decoders is significantly observed. Motivated by the advantage of convolutional neural network (i.e., CNN), in this paper, we propose a Dual-CNN decoder with long-term memory ability and parallel computation, which can produce a semantically coherent paragraph for an image. Our Dual-CNN model is evaluated on the Stanford image-paragraph dataset. Extensive experiments demonstrate that our Dual-CNN achieves comparable results compared with state-of-the-art models. Furthermore, the diversity and coherence of generated paragraphs are analyzed to show the superiority of our approach.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Paragraph Captioning Image Paragraph Captioning Dual-CNN BLEU-4 8.6 # 8
METEOR 15.8 # 8
CIDEr 17.4 # 8

Methods


No methods listed for this paper. Add relevant methods here