The SumMe dataset is a video summarization dataset consisting of 25 videos, each annotated with at least 15 human summaries (390 in total).
124 PAPERS • 3 BENCHMARKS
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
10 PAPERS • 2 BENCHMARKS
VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).
3 PAPERS • NO BENCHMARKS YET