Dense Video Captioning

25 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

qiqang/uedvc 18 Jul 2022

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

8
18 Jul 2022

Dense Video Captioning Using Unsupervised Semantic Information

valterlej/dvcusi 15 Dec 2021

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e. g., minutes) can be decomposed into simpler events (e. g., a few seconds), and that these simple events are shared across several complex events.

6
15 Dec 2021

End-to-End Dense Video Captioning with Parallel Decoding

ttengwang/pdvc ICCV 2021

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.

187
17 Aug 2021

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

HumamAlwassel/TSP 23 Nov 2020

Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning.

105
23 Nov 2020

Multimodal Pretraining for Dense Video Captioning

google-research-datasets/Video-Timeline-Tags-ViTT Asian Chapter of the Association for Computational Linguistics 2020

First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.

18
10 Nov 2020

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

ttengwang/dense-video-captioning-pytorch 21 Jun 2020

This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.

72
21 Jun 2020

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

v-iashin/video_features 17 May 2020

We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.

437
17 May 2020

Multi-modal Dense Video Captioning

v-iashin/video_features 17 Mar 2020

We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.

437
17 Mar 2020

Streamlined Dense Video Captioning

ttengwang/ESGN CVPR 2019

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events.

13
08 Apr 2019