Video Captioning

154 papers with code • 11 benchmarks • 31 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Libraries

Use these libraries to find Video Captioning models and implementations

Latest papers with no code

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

no code yet • 8 Mar 2024

Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

no code yet • 29 Feb 2024

Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

no code yet • 27 Feb 2024

Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task.

Video ReCap: Recursive Captioning of Hour-Long Videos

no code yet • 20 Feb 2024

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark

no code yet • 25 Jan 2024

We develop a knowledge guided entity-aware video captioning network (KEANet) based on a candidate player list in encoder-decoder form for basketball live text broadcast.

SnapCap: Efficient Snapshot Compressive Video Captioning

no code yet • 10 Jan 2024

To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap.

Retrieval-Augmented Egocentric Video Captioning

no code yet • 1 Jan 2024

In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

no code yet • 25 Dec 2023

Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction.

Subject-Oriented Video Captioning

no code yet • 20 Dec 2023

To address this problem, we propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.

Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)

no code yet • 12 Dec 2023

Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain.