7 dataset results for Video Description

The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators.

731 PAPERS • 9 BENCHMARKS

YouCook

This data set was prepared from 88 open-source YouTube cooking videos. The YouCook dataset contains videos of people cooking various recipes. The videos were downloaded from YouTube and are all in the third-person viewpoint; they represent a significantly more challenging visual problem than existing cooking and kitchen datasets (the background kitchen/scene is different for many and most videos have dynamic camera changes). In addition, frame-by-frame object and action annotations are provided for training data (as well as a number of precomputed low-level features). Finally, each video has a number of human provided natural language descriptions (on average, there are eight different descriptions per video). This dataset has been created to serve as a benchmark in describing complex real-world videos with natural language descriptions.

44 PAPERS • NO BENCHMARKS YET

ActivityNet Entities

ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe.

15 PAPERS • NO BENCHMARKS YET

VideoCC3M (Video-Conceptual-Captions)

We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.

9 PAPERS • NO BENCHMARKS YET

TACoS Multi-Level Corpus

Augments the video-description dataset TACoS with short and single sentence descriptions.

6 PAPERS • NO BENCHMARKS YET

EDUB-Seg

EDUB-Seg (Egocentric Dataset of the University of Barcelona – Segmentation)

Egocentric Dataset of the University of Barcelona – Segmentation (EDUB-Seg) is a dataset for egocentric event segmentation acquired by the Narrative Clip, which takes a picture every 30 seconds. The dataset contains a total of 18,735 images captured by 7 different users during overall 20 days. To ensure diversity, all users were wearing the camera in different contexts: while attending a conference, on holiday, during the weekend, and during the week.

4 PAPERS • NO BENCHMARKS YET

M-VAD Names (M-VAD Names Dataset)

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the associations with characters' textual mentions, when available. The detection and annotation of the visual appearances of characters in each video clip of each movie was achieved through a semi-automatic approach. The released dataset contains more than 24k annotated video clips, including 63k visual tracks and 34k textual mentions, all associated with their character identities.

3 PAPERS • 1 BENCHMARK

Datasets

7 dataset results for Video Description