The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,449 PAPERS • 93 BENCHMARKS
The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. While other datasets outdoors exist, they are all restricted to a small recording volume. 3DPW is the first one that includes video footage taken from a moving phone camera.
355 PAPERS • 5 BENCHMARKS
JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset of the larger HMDB51 dataset collected from digitized movies and YouTube videos. The dataset contains video and annotation for puppet flow per frame (approximated optimal flow on the person), puppet mask per frame, joint positions per frame, action label per clip and meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality).
235 PAPERS • 8 BENCHMARKS
The YCB-Video dataset is a large-scale video dataset for 6D object pose estimation. provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames.
150 PAPERS • 6 BENCHMARKS
MPII Human Pose Dataset is a dataset for human pose estimation. It consists of around 25k images extracted from online videos. Each image contains one or more people, with over 40k people annotated in total. Among the 40k samples, ∼28k samples are for training and the remainder are for testing. Overall the dataset covers 410 human activities and each image is provided with an activity label. Images were extracted from a YouTube video and provided with preceding and following un-annotated frames.
138 PAPERS • 2 BENCHMARKS
SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
116 PAPERS • NO BENCHMARKS YET
The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.
104 PAPERS • 4 BENCHMARKS
The TotalCapture dataset consists of 5 subjects performing several activities such as walking, acting, a range of motion sequence (ROM) and freestyle motions, which are recorded using 8 calibrated, static HD RGB cameras and 13 IMUs attached to head, sternum, waist, upper arms, lower arms, upper legs, lower legs and feet, however the IMU data is not required for our experiments. The dataset has publicly released foreground mattes and RGB images. Ground-truth poses are obtained using a marker-based motion capture system, with the markers are <5mm in size. All data is synchronised and operates at a framerate of 60Hz, providing ground truth poses as joint positions.
52 PAPERS • 2 BENCHMARKS
The FLIC dataset contains 5003 images from popular Hollywood movies. The images were obtained by running a state-of-the-art person detector on every tenth frame of 30 movies. People detected with high confidence (roughly 20K candidates) were then sent to the crowdsourcing marketplace Amazon Mechanical Turk to obtain ground truth labelling. Each image was annotated by five Turkers to label 10 upper body joints. The median-of-five labelling was taken in each image to be robust to outlier annotation. Finally, images were rejected manually by if the person was occluded or severely non-frontal.
45 PAPERS • 2 BENCHMARKS
JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios.
32 PAPERS • 1 BENCHMARK
The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.
26 PAPERS • NO BENCHMARKS YET
Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
16 PAPERS • 2 BENCHMARKS
A three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
16 PAPERS • NO BENCHMARKS YET
First-Person Hand Action Benchmark is a collection of RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations.
13 PAPERS • 2 BENCHMARKS
The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.
10 PAPERS • NO BENCHMARKS YET
We provide manual annotations of 14 semantic keypoints for 100,000 car instances (sedan, suv, bus, and truck) from 53,000 images captured from 18 moving cameras at Multiple intersections in Pittsburgh, PA. Please fill the google form to get a email with the download links:
8 PAPERS • 2 BENCHMARKS
MonoPerfCap is a benchmark dataset for human 3D performance capture from monocular video input consisting of around 40k frames, which covers a variety of different scenarios.
6 PAPERS • NO BENCHMARKS YET
SLOPER4D is a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. It consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 2,000 (up to 13,000), including more than 100K LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE.
6 PAPERS • 1 BENCHMARK
BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task:
5 PAPERS • 2 BENCHMARKS
The SynthHands dataset is a dataset for hand pose estimation which consists of real captured hand motion retargeted to a virtual hand with natural backgrounds and interactions with different objects. The dataset contains data for male and female hands, both with and without interaction with objects. While the hand and foreground object are synthtically generated using Unity, the motion was obtained from real performances as described in the accompanying paper. In addition, real object textures and background images (depth and color) were used. Ground truth 3D positions are provided for 21 keypoints of the hand.
5 PAPERS • NO BENCHMARKS YET
A new dataset with significant occlusions related to object manipulation.
4 PAPERS • NO BENCHMARKS YET
The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.
3 PAPERS • NO BENCHMARKS YET
Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
2 PAPERS • NO BENCHMARKS YET
The HOPE-Video dataset contains 10 video sequences (2038 frames) with 5-20 objects on a tabletop scene captured by a robot arm-mounted RealSense D415 RGBD camera. In each sequence, the camera is moved to capture multiple views of a set of objects in the robotic workspace. First COLMAP was applied to refine the camera poses (keyframes at 6~fps) provided by forward kinematics and RGB calibration from RealSense to Baxter's wrist camera. 3D dense point cloud was then generated via CascadeStereo (included for each sequence in 'scene.ply'). Ground truth poses for the HOPE objects models in the world coordinate system were annotated manually using the CascadeStereo point clouds. The following are provided for each frame:
HuPR is a human pose estimation benchmark is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. This dataset contains 235 sequences of data in an indoor environment, with each sequence being one-minute long and totalling about 4 hour-long video data.
The data includes all movement trajectories extracted from the videos of Parkinson's assessments using Convolutional Pose Machines (CPM) as well as the confidence values from CPM. The dataset also includes ground truth ratings of parkinsonism and dyskinesia severity using the UDysRS, UPDRS, and CAPSIT.
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. To tackle this issue with a common benchmark, we introduce the Drunkard’s Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality.
1 PAPER • 1 BENCHMARK
InfiniteRep is a synthetic, open-source dataset for fitness and physical therapy (PT) applications. It includes 1k videos of diverse avatars performing multiple repetitions of common exercises. It includes significant variation in the environment, lighting conditions, avatar demographics, and movement trajectories. From cadence to kinematic trajectory, each rep is done slightly differently -- just like real humans. InfiniteRep videos are accompanied by a rich set of pixel-perfect labels and annotations, including frame-specific repetition counts.
0 PAPER • NO BENCHMARKS YET