Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.
17 PAPERS • NO BENCHMARKS YET
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
2 PAPERS • NO BENCHMARKS YET
The IAW dataset contains 420 Ikea furniture pieces from 14 common categories e.g. sofa, bed, wardrobe, table, etc. Each piece of furniture comes with one or more user instruction manuals, which are first divided into pages and then further divided into independent steps cropped from each page (some pages contain more than one step and some pages do not contain instructions). There are 8568 pages and 8263 steps overall, on average 20.4 pages and 19.7 steps for each piece of furniture. We crawled YouTube to find videos corresponding to these instruction manuals and as such the conditions in the videos are diverse on many aspects e.g. duration, resolution, first- or third-person view, camera pose, background environment, number of assemblers, etc. The IAW dataset contains 1005 raw videos with a length of around 183 hours in total. Among them, approximately 114 hours of content are labeled as 15649 actions to match the corresponding step in the corresponding manual.
1 PAPER • NO BENCHMARKS YET