Video Summarization
68 papers with code • 5 benchmarks • 13 datasets
Video Summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts. The produced summary is usually composed of a set of representative video frames (a.k.a. video key-frames), or video fragments (a.k.a. video key-fragments) that have been stitched in chronological order to form a shorter video. The former type of a video summary is known as video storyboard, and the latter type is known as video skim.
Source: Video Summarization Using Deep Neural Networks: A Survey
Image credit: iJRASET
Datasets
Latest papers
Adopting Self-Supervised Learning into Unsupervised Video Summarization through Restorative Score.
We show that the reconstruction loss of the model for a video with masked frames correlates with the representativeness of the remaining frames in the video.
UniVTG: Towards Unified Video-Language Temporal Grounding
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks.
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset.
Joint Moment Retrieval and Highlight Detection Via Natural Language Queries
Video summarization has become an increasingly important task in the field of computer vision due to the vast amount of video content available on the internet.
Hierarchical Video-Moment Retrieval and Step-Captioning
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
SELF-VS: Self-supervised Encoding Learning For Video Summarization
Empirical evaluations on correlation-based metrics, such as Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.
VideoXum: Cross-modal Visual and Textural Summarization of Videos
We propose a new joint video and text summarization task.
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
VideoSum: A Python Library for Surgical Video Summarization
It is thus unsurprising that substantial research efforts are made to develop methods aiming at mitigating the scarcity of annotated SDS data.