5 dataset results for Video-Text Retrieval

WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their content.

182 PAPERS • 1 BENCHMARK

SYMON

SYMON (Synopses of Movie Narratives)

Contains 5,193 video summaries of popular movies and TV series. SyMoN captures naturalistic storytelling videos for human audience made by human creators, and has higher story coverage and more frequent mental-state references than similar video-language story datasets.

3 PAPERS • NO BENCHMARKS YET

Youku-mPLUG

Youku-mPLUG is a large Chinese high-quality video-language dataset which is collected from Youku.com, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. It contains 10 million video-text pairs for pre-training and 0.3 millon videos for downstream benchmarks covering Video-Text Retrieval, Video Captioning and Video Category Classification.

3 PAPERS • NO BENCHMARKS YET

Test-of-Time (Test of Time Synthetic Video Dataset)

The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.

2 PAPERS • 1 BENCHMARK

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 PAPERS • NO BENCHMARKS YET

Datasets

5 dataset results for Video-Text Retrieval