Dense Video Captioning
25 papers with code • 4 benchmarks • 7 datasets
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Latest papers with no code
Exploiting Auxiliary Caption for Video Grounding
Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video.
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
We refer to this task as Procedure Segmentation and Summarization (PSS).
Recipe Generation from Unsegmented Cooking Videos
However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them.
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions
Further, we validate our approach against the existing state-of-the-art algorithms for the Dense Video Captioning task for the ActivityNet Captions dataset.
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning
The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video.
End-to-end Dense Video Captioning as Sequence Generation
Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event.
Semantic-Aware Pretraining for Dense Video Captioning
This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021.
End-to-end Dense Video Captioning as Sequence Generation
Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event.
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity.
Sketch, Ground, and Refine: Top-Down Dense Video Captioning
The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling.