WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their content.
182 PAPERS • 1 BENCHMARK
Contains 5,193 video summaries of popular movies and TV series. SyMoN captures naturalistic storytelling videos for human audience made by human creators, and has higher story coverage and more frequent mental-state references than similar video-language story datasets.
3 PAPERS • NO BENCHMARKS YET
Youku-mPLUG is a large Chinese high-quality video-language dataset which is collected from Youku.com, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. It contains 10 million video-text pairs for pre-training and 0.3 millon videos for downstream benchmarks covering Video-Text Retrieval, Video Captioning and Video Category Classification.
The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.
2 PAPERS • 1 BENCHMARK
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
2 PAPERS • NO BENCHMARKS YET