Video Captioning
160 papers with code • 11 benchmarks • 32 datasets
Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
Source: NITS-VC System for VATEX Video Captioning Challenge 2020
Libraries
Use these libraries to find Video Captioning models and implementationsSubtasks
Latest papers
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems.
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.
Streaming Dense Video Captioning
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.
OmniVid: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
LVCHAT: Facilitating Long Video Comprehension
To address this issue, we propose Long Video Chat (LVChat), where Frame-Scalable Encoding (FSE) is introduced to dynamically adjust the number of embeddings in alignment with the duration of the video to ensure long videos are not overly compressed into a few embeddings.
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists.
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Following such a pipeline, we study the effect of doubling the scale of training set (i. e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9. 67 to 8. 19 and FVD from 484 to 441), demonstrating the scalability of our approach.
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
A human need to capture both the event in every shot and associate them together to understand the story behind it.
VTimeLLM: Empower LLM to Grasp Video Moments
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.