AOG-LSTM: An adaptive attention neural network for visual storytelling

Visual storytelling is the task of generating a related story for a given image sequence, which has received significant attention. However, using general RNNs (such as LSTM and GRU) as the decoder limit the performance of the models in this task. This is because they can not differentiate different types of information representations. In addition, optimizing the probabilities of subsequent words conditioned on the previous ground-truth sequences can cause error accumulation during inference. Moreover, the existing method of alleviating error accumulation based on replacing reference words does not take into account the different effects of each word. To address the above problems, we propose a modified neural network named AOG-LSTM and a modified training strategy named ARS, respectively. AOG-LSTM can adaptatively pay appropriate attention to different information representations within it when predicting different words. During training, ARS replaces some words in the reference sentences with model predictions similar to the existing method. However, we utilize the selection network and selection strategy to select more appropriate words for the replacement to better improve the model. Experiments on the VIST Dataset demonstrate that our model outperforms several strong baselines on the most commonly used metrics.

PDF
No code implementations yet. Submit your code now

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Storytelling VIST AOG + ARS BLEU-1 69 # 1
BLEU-2 44 # 1
BLEU-3 23.9 # 4
BLEU-4 12.9 # 20
METEOR 36.0 # 6
CIDEr 12.0 # 4
ROUGE-L 30.1 # 11

Methods