Dense Video Captioning
24 papers with code • 4 benchmarks • 7 datasets
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Latest papers with no code
The 8th AI City Challenge
The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities.
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis
Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view.
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols
Dense Video Captioning (DVC) aims at detecting and describing different events in a given video.
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges
Furthermore, we benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding.
VidChapters-7M: Video Chapters at Scale
To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment
This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model.
Visual Transformation Telling
In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT).
A Review of Deep Learning for Video Captioning
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction.