🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

866 dataset results for Videos

VOST consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex transformations, capturing their full temporal extent.

4 PAPERS • NO BENCHMARKS YET

VOT2019

VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB.

4 PAPERS • 1 BENCHMARK

VidOR

VidOR (Video Object Relation) dataset contains 10,000 videos (98.6 hours) from YFCC100M collection together with a large amount of fine-grained annotations for relation understanding. In particular, 80 categories of objects are annotated with bounding-box trajectory to indicate their spatio-temporal location in the videos; and 50 categories of relation predicates are annotated among all pairs of annotated objects with starting and ending frame index. This results in around 50,000 object and 380,000 relation instances annotated. To use the dataset for model development, the dataset is split into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing.

4 PAPERS • 2 BENCHMARKS

WITS (Why Is This Sarcastic?)

This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:

4 PAPERS • 2 BENCHMARKS

YCBInEOAT Dataset

A new dataset with significant occlusions related to object manipulation.

4 PAPERS • NO BENCHMARKS YET

Zenseact Open Dataset

The Zenseact Open Dataset (ZOD) is a large-scale and diverse multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.

4 PAPERS • NO BENCHMARKS YET

ARMBench

ARMBench is a large-scale, object-centric benchmark dataset for robotic manipulation in the context of a warehouse. ARMBench contains images, videos, and metadata that corresponds to 235K+ pick-and-place activities on 190K+ unique objects. The data is captured at different stages of manipulation, i.e., pre-pick, during transfer, and after placement.

3 PAPERS • NO BENCHMARKS YET

ActivityNet Adverbs

ActivityNet Adverbs is a subset from the ActivityNet dataset with extracted verb-adverb annotations. ActivityNet Adverbs contains 20 adverbs appearing across 114 actions, forming 643 unique action-adverb pairs in 3,099 video clips.

3 PAPERS • 2 BENCHMARKS

BUAA-MIHR dataset

BUAA-MIHR dataset (Large-scale-Multi-illumination-HR-Database)

BUAA-MIHR dataset is a remote photoplethysmography (rPPG) dataset. BUAA-MIHR dataset for evaluation of remote photoplethysmography pipeline under multi-illumination situations. We recruited 15 healthy subjects (12 male, 3 female, 18 to 30 years old) in this experiment and a total number of 165 video sequences were recorded under various illuminations. The experiments were conducted in a darkroom in order to isolate from ambient light.

3 PAPERS • NO BENCHMARKS YET

CCv2 (Casual Conversations v2)

Casual Conversations v2 (CCv2) is composed of over 5,567 participants (26,467 videos) and intended mainly to be used for assessing the performance of already trained models in computer vision and audio applications for the purposes permitted in our data license agreement. The videos feature paid individuals who agreed to participate in the project and explicitly provided Age, Gender, Language/Dialect, Geo-location, Disability, Physical adornments, Physical attributes labels themselves. The videos were recorded in Brazil, India, Indonesia, Mexico, Philippines, United States, and Vietnam with a diverse set of adults in various categories. A group of trained annotators labeled the participants’ apparent skin tone using the Fitzpatrick scale and Monk Scale, in addition to annotations of Voice timbre, Activity and Recording setups. Spoken words in all videos are either scripted (a sample paragraph from The Idiot by Fyodor Dostoevsky provided with the dataset) or nonscripted (answering one o

3 PAPERS • NO BENCHMARKS YET

CHAD (Charlotte Anomaly Dataset)

CHAD: Charlotte Anomaly Dataset CHAD is high-resolution, multi-camera dataset for surveillance video anomaly detection. It includes bounding box, Re-ID, and pose annotations, as well as frame-level anomaly labels, dividing all frames into two groups of anomalous or normal. You can find the paper with all the details in the following link: CHAD: Charlotte Anomaly Dataset. Please refer to the page of the dataset for more information.

3 PAPERS • NO BENCHMARKS YET

CholecT40 (Cholecystectomy Action Triplet)

CholecT40 is the first endoscopic dataset introduced to enable research on fine-grained action recognition in laparoscopic surgery.

3 PAPERS • NO BENCHMARKS YET

CoP3D

CoP3D is a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP2D is a large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild".

3 PAPERS • NO BENCHMARKS YET

Composable activities dataset

The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.

3 PAPERS • NO BENCHMARKS YET

Content4All

Content4All is a collection of six open research datasets aimed at automatic sign language translation research.

3 PAPERS • NO BENCHMARKS YET

Countix-AV

Countix-AV is a dataset for repetitive action counting by sight and sound created by repurposing the Countix dataset.

3 PAPERS • NO BENCHMARKS YET

DCASE 2014

DCASE2014 is an audio classification benchmark.

3 PAPERS • NO BENCHMARKS YET

Deep Fakes Dataset (inamibora)

The Deep Fakes Dataset is a collection of "in the wild" portrait videos for deepfake detection. The videos in the dataset are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totalling up to 142 videos, 32 minutes, and 17 GBs. Synthetic videos are matched with their original counterparts when possible.

3 PAPERS • NO BENCHMARKS YET

DukeMTMC-attribute

The images in DukeMTMC-attribute dataset comes from Duke University. There are 1812 identities and 34183 annotated bounding boxes in the DukeMTMC-attribute dataset. This dataset contains 702 identities for training and 1110 identities for testing, corresponding to 16522 and 17661 images respectively. The attributes are annotated in the identity level, every image in this dataset is annotated with 23 attributes.

3 PAPERS • NO BENCHMARKS YET

Dynamic Replica

Dynamic Replica is a synthetic dataset of stereo videos featuring humans and animals in virtual environments. It is a benchmark for dynamic disparity/depth estimation and 3D reconstruction consisting of 145,200 stereo frames (524 videos).

3 PAPERS • NO BENCHMARKS YET

EPIC-Hotspot

From Grounded Human-Object Interaction Hotspots from Video (ICCV'19): We collect annotations for interaction keypoints on EPIC Kitchens in order to quantitatively evaluate our method in parallel to the OPRA dataset (where annotations are available). We note that these annotations are collected purely for evaluation, and are not used for training our model. We select the 20 most frequent verbs, and select 31 nouns that afford these interactions.

3 PAPERS • 1 BENCHMARK

EgoHOS (Fine-Grained Egocentric Hand-Object Segmentation Dataset)

EgoHOS is a labeled dataset consisting of 11243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with during a diverse array of daily activities. The data are collected form multiple sources: 7,458 frames from Ego4D, 2,212 frames from EPIC-KITCHEN, 806 frames from THU-READ, and 350 frames of our own collected egocentric videos with people playing Escape Room. This dataset is designed for tasks including hand state classification, video activity recognition, 3D mesh reconstruction of hand-object interactions, and video inpainting of hand-object foregrounds in egocentric videos.

3 PAPERS • NO BENCHMARKS YET

EgoPAT3D

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 PAPERS • NO BENCHMARKS YET

EgoProceL

EgoProceL is a large-scale dataset for procedure learning. It consists of 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning. EgoProceL contains videos and key-step annotations for multiple tasks from CMU-MMAC, EGTEA Gaze+, and individual tasks like toy-bike assembly, tent assembly, PC assembly, and PC disassembly. EgoProceL overcomes the limitations of third-person videos. As, using third-person videos makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.

3 PAPERS • NO BENCHMARKS YET

FR-FS (Fall Recognition in Figure Skating)

The FR-FS dataset contains 417 videos collected from FIV dataset and Pingchang 2018 Winter Olympic Games. FR-FS contains the critical movements of the athlete’s take-off, rotation, and landing. Among them, 276 are smooth landing videos, and 141 are fall videos. To test the generalization performance of our proposed model, we randomly select 50% of the videos from the fall and landing videos as the training set and the testing set.

3 PAPERS • NO BENCHMARKS YET

Ford Campus Vision and Lidar Data Set

Ford Campus Vision and Lidar Data Set is a dataset collected by an autonomous ground vehicle testbed, based upon a modified Ford F-250 pickup truck. The vehicle is outfitted with a professional (Applanix POS LV) and consumer (Xsens MTI-G) Inertial Measuring Unit (IMU), a Velodyne 3D-lidar scanner, two push-broom forward looking Riegl lidars, and a Point Grey Ladybug3 omnidirectional camera system.

3 PAPERS • NO BENCHMARKS YET

ForgeryNet

We construct the ForgeryNet dataset, an extremely large face forgery dataset with unified annotations in image- and video-level data across four tasks: 1) Image Forgery Classification, including two-way (real / fake), three-way (real / fake with identity-replaced forgery approaches / fake with identity-remained forgery approaches), and n-way (real and 15 respective forgery approaches) classification. 2) Spatial Forgery Localization, which segments the manipulated area of fake images compared to their corresponding source real images. 3) Video Forgery Classification, which re-defines the video-level forgery classification with manipulated frames in random positions. This task is important because attackers in real world are free to manipulate any target frame. and 4) Temporal Forgery Localization, to localize the temporal segments which are manipulated. ForgeryNet is by far the largest publicly available deep face forgery dataset in terms of data-scale (2.9 million images, 221,247 video

3 PAPERS • 2 BENCHMARKS

Goal

Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.

3 PAPERS • NO BENCHMARKS YET

HANDAL (HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions)

We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline s

3 PAPERS • NO BENCHMARKS YET

HDM05

HDM05 is a MoCap (motion capture) dataset. It contains more than three hours of systematically recorded and well-documented motion capture data in the C3D as well as in the ASF/AMC data format. HDM05 contains almost 2337 sequences with 130 motion classes performed by 5 different actors.

3 PAPERS • 1 BENCHMARK

IfAct

IfAct (Identifying Human Actions Visible in Online Vlogs)

We consider the task of identifying human actions visible in online videos. We focus on the widely spread genre of lifestyle vlogs, which consist of videos of people performing actions while verbally describing them. Our goal is to identify if actions mentioned in the speech description of a video are visually present.

3 PAPERS • NO BENCHMARKS YET

KITTI-Masks

This Dataset consists of 2120 sequences of binary masks of pedestrians. The sequence length varies between 2-710. For details, we refer to our paper. It is based on the original KITTI Segmentation challenge which can be found at https://www.vision.rwth-aachen.de/page/mots

3 PAPERS • 1 BENCHMARK

M-VAD Names (M-VAD Names Dataset)

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the associations with characters' textual mentions, when available. The detection and annotation of the visual appearances of characters in each video clip of each movie was achieved through a semi-automatic approach. The released dataset contains more than 24k annotated video clips, including 63k visual tracks and 34k textual mentions, all associated with their character identities.

3 PAPERS • 1 BENCHMARK

M5Product

The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

3 PAPERS • NO BENCHMARKS YET

MERL Shopping

MERL Shopping is a dataset for training and testing action detection algorithms. The MERL Shopping Dataset consists of 106 videos, each of which is a sequence about 2 minutes long. The videos are from a fixed overhead camera looking down at people shopping in a grocery store setting. Each video contains several instances of the following 5 actions: "Reach To Shelf" (reach hand into shelf), "Retract From Shelf " (retract hand from shelf), "Hand In Shelf" (extended period with hand in the shelf), "Inspect Product" (inspect product while holding it in hand), and "Inspect Shelf" (look at shelf while not touching or reaching for the shelf).

3 PAPERS • NO BENCHMARKS YET

MEVID (Multi-view Extended Videos with Identities Dataset)

Multi-view Extended Videos with Identities dataset (MEVID) is a dataset for large-scale, video person re-identification (ReID) in the wild. It spans an extensive indoor and outdoor environment across nine unique dates in a 73-day window, various camera viewpoints, and entity clothing changes. Specifically, it contains labels of the identities of 158 unique people wearing 598 outfits taken from 8, 092 tracklets, average length of about 590 frames, seen in 33 camera views from the very large-scale MEVA person activities dataset.

3 PAPERS • NO BENCHMARKS YET

MISP2021 (Multimodal Information Based Speech Processing 2021)

The MISP2021 challenge dataset is a collection of audio-visual conversational data recorded in a home TV scenario using distant multi-microphones. The dataset captures interactions between several individuals who are engaged in conversations in Chinese while watching TV and interacting with a smart speaker/TV in a living room. The dataset is extensive, comprising 141 hours of audio and video data, which were collected using far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. Notably, this corpus is the first of its kind to offer a distant multimicrophone conversational Chinese audio-visual dataset. Furthermore, it is also the first large vocabulary continuous Chinese lip-reading dataset specifically designed for the adverse home-TV scenario.

3 PAPERS • NO BENCHMARKS YET

MMDB (Multimodal Dyadic Behavior)

Multimodal Dyadic Behavior (MMDB) dataset is a unique collection of multimodal (video, audio, and physiological) recordings of the social and communicative behavior of toddlers. The MMDB contains 160 sessions of 3-5 minute semi-structured play interaction between a trained adult examiner and a child between the age of 15 and 30 months. The MMDB dataset supports a novel problem domain for activity recognition, which consists of the decoding of dyadic social interactions between adults and children in a developmental context.

3 PAPERS • NO BENCHMARKS YET

MSR-VTT Adverbs

MSR-VTT Adverbs is a subset from MSR-VTT with extracted verb-adverb annotations. MSR-VTT Adverbs contains 18 adverbs appearing across 106 actions, forming 464 unique action-adverb pairs in 1,824 video clips.

3 PAPERS • 2 BENCHMARKS

MSU Video Alignment and Retrieval Benchmark Suite

Frame-to-frame video alignment/synchronization

3 PAPERS • 1 BENCHMARK

MeLa BitChute

MeLa BitChute is a near-complete dataset of over 3M videos from 61K channels over 2.5 years (June 2019 to December 2021) from the social video hosting platform BitChute, a commonly used alternative to YouTube. Additionally, the dataset includes a variety of video-level metadata, including comments, channel descriptions, and views for each video.

3 PAPERS • NO BENCHMARKS YET

MovieShots

MovieShots is a dataset to facilitate the shot type analysis in videos. It is a large-scale shot type annotation set that contains 46K shots from 7,858 movies covering a wide variety of movie genres to ensure the inclusion of all scale and movement types of shot. Each shot has two attributes, shot scale and shot movement.

3 PAPERS • NO BENCHMARKS YET

Moviescope

Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.

3 PAPERS • NO BENCHMARKS YET

Navigation Turing Test

Replay data from human players and AI agents navigating in a 3D game environment.

3 PAPERS • NO BENCHMARKS YET

OAK (Objects Around Krishna)

OAK is a dataset for online continual object detection benchmark with an egocentric video dataset. OAK adopts the KrishnaCam videos, an ego-centric video stream collected over nine months by a graduate student. OAK provides exhaustive bounding box annotations of 80 video snippets (~17.5 hours) for 105 object categories in outdoor scenes.

3 PAPERS • NO BENCHMARKS YET

OREBA (Objectively Recognizing Eating Behavior and Associated Intake)

The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands.

3 PAPERS • NO BENCHMARKS YET

OpenLane-V2 test

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.

3 PAPERS • 1 BENCHMARK

OpenTTGames

OSAI introduces OpenTTGames - an open dataset aimed at evaluation of different computer vision tasks in Table Tennis: ball detection, semantic segmentation of humans, table and scoreboard and fast in-game events spotting.

3 PAPERS • NO BENCHMARKS YET

Datasets

866 dataset results for Videos