video narration captioning

1 papers with code • 1 benchmarks • 1 datasets

Human narration is another critical factor to understand a multi-shot video. It often provides information of the background knowledge and commentator’s view on visual events. We conduct experiments to predict the narration caption of a video-shot and name this task single-shot narration captioning. We adopt the same model structure as single-shot video captioning with the ASR text as additional input, except that the prediction target is the narration caption.

Most implemented papers

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.