Video Captioning

160 papers with code • 11 benchmarks • 32 datasets

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Captioning

Dataset	Best Model	Compare
MSR-VTT	mPLUG-2	See all
MSVD	MaMMUT (ours)	See all
YouCook2	VAST	See all
VATEX	VALOR	See all
ActivityNet Captions	VideoCoCa	See all
Hindi MSR-VTT	SBD_Keyframe	See all
TVC	VAST	See all
MSVD-Indonesian	VNS-GRU (Cross-Lingual)	See all
ChinaOpen-1k	GVT	See all
Shot2Story20K	Ours	See all
VidChapters-7M	Vid2Seq	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Captioning models and implementations

rakshithShetty/captionGAN

2 papers

Datasets

Subtasks

Audio-Visual Video Captioning

Video Boundary Captioning

Latest papers

Most implemented Social Latest No code

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

quangminhdinh/trafficvlm • • 14 Apr 2024

Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems.

14 Apr 2024

Paper
Code

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

faceonlive/ai-research • 11 Apr 2024

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.

107

11 Apr 2024

Paper
Code

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM • • 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

08 Apr 2024

Paper
Code

Streaming Dense Video Captioning

google-research/scenic • • 1 Apr 2024

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

2,979

01 Apr 2024

Paper
Code

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid • • 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

26 Mar 2024

Paper
Code

LVCHAT: Facilitating Long Video Comprehension

wangyu-ustc/lvchat • • 19 Feb 2024

To address this issue, we propose Long Video Chat (LVChat), where Frame-Scalable Encoding (FSE) is introduced to dynamically adjust the number of embeddings in alignment with the duration of the video to ensure long videos are not overly compressed into a few embeddings.

19 Feb 2024

Paper
Code

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

yuhui-zh15/c3 • • 16 Jan 2024

However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists.

16 Jan 2024

Paper
Code

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

ali-vilab/i2vgen-xl • • 25 Dec 2023

Following such a pipeline, we study the effect of doubling the scale of training set (i. e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9. 67 to 8. 19 and FVD from 484 to 441), demonstrating the scalability of our approach.

25 Dec 2023

Paper
Code

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story • • 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.

16 Dec 2023

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

103

30 Nov 2023

Paper
Code

Video Captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result