HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

PDF Abstract ICCV 2019 PDF ICCV 2019 Abstract

Datasets


Introduced in the Paper:

HowTo100M

Used in the Paper:

MSR-VTT DiDeMo YouCook2 LSMDC CrossTask
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Temporal Action Localization CrossTask Text-Video Embedding Recall 33.6 # 4
Video Retrieval LSMDC Text-Video Embedding text-to-video R@1 7.2 # 36
text-to-video R@5 19.6 # 31
text-to-video R@10 27.9 # 30
text-to-video Median Rank 40 # 19
Video Retrieval MSR-VTT Text-Video Embedding text-to-video R@1 14.9 # 33
text-to-video R@10 52.8 # 28
text-to-video Median Rank 9 # 12
video-to-text R@5 40.2 # 10
Video Retrieval MSR-VTT-1kA HT text-to-video R@1 12.1 # 54
text-to-video R@5 35.0 # 53
text-to-video R@10 48.0 # 56
text-to-video Median Rank 12 # 37
Video Retrieval MSR-VTT-1kA HT-Pretrained text-to-video R@1 14.9 # 53
text-to-video R@5 40.2 # 52
text-to-video R@10 52.8 # 55
text-to-video Median Rank 9 # 36
Video Retrieval YouCook2 Text-Video Embedding text-to-video Median Rank 24 # 7
text-to-video R@1 8.2 # 11
text-to-video R@10 35.3 # 13
text-to-video R@5 24.5 # 10
Long Video Retrieval (Background Removed) YouCook2 Text-Video Embedding Cap. Avg. R@1 46.6 # 5
Cap. Avg. R@5 74.3 # 5
Cap. Avg. R@10 83.7 # 4

Methods


No methods listed for this paper. Add relevant methods here