When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions

INLG (ACL) 2020  ·  Nikolai Ilinykh, Simon Dobnik ·

Generating multi-sentence image descriptions is a challenging task, which requires a good model to produce coherent and accurate paragraphs, describing salient objects in the image. We argue that multiple sources of information are beneficial when describing visual scenes with long sequences. These include (i) perceptual information and (ii) semantic (language) information about how to describe what is in the image. We also compare the effects of using two different pooling mechanisms on either a single modality or their combination. We demonstrate that the model which utilises both visual and language inputs can be used to generate accurate and diverse paragraphs when combined with a particular pooling mechanism. The results of our automatic and human evaluation show that learning to embed semantic information along with visual stimuli into the paragraph generation model is not trivial, raising a variety of proposals for future experiments.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Paragraph Captioning Image Paragraph Captioning IMG+LNG BLEU-4 4.67 # 10
METEOR 11.30 # 10
CIDEr 26.38 # 3

Methods


No methods listed for this paper. Add relevant methods here