Summarizing Videos with Attention

In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.

PDF Abstract

Datasets


Results from the Paper


Ranked #3 on Video Summarization on TvSum (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Summarization SumMe VASNet F1-score (Canonical) 49.71 # 4
F1-score (Augmented) 51.09 # 3
Video Summarization TvSum VASNet F1-score (Canonical) 61.42 # 3
F1-score (Augmented) 62.37 # 3

Methods