SAMRS is a remote sensing segmentation dataset which provides object category, location, and instance information that can be used for semantic segmentation, instance segmentation, and object detection, either individually or in combination.
4 PAPERS • NO BENCHMARKS YET
SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, covering an area of 2,544 sq. km of 3/8 band WorldView-2 imagery (0.5 m pixel res.) across the city of Rio de Janeiro, Brazil. The images are processed as 200m×200m tiles with associated building footprint vectors for training.
4 PAPERS • 2 BENCHMARKS
Swiss3DCities is a dataset that is manually annotated for semantic segmentation with per-point labels, and is built using photogrammetry from images acquired by multirotors equipped with high-resolution cameras.
A large collection of interaction-rich video data which are annotated and analyzed.
Synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models.
4 PAPERS • 1 BENCHMARK
AeroRIT is a hyperspectral dataset to facilitate aerial hyperspectral scene understanding.
3 PAPERS • NO BENCHMARKS YET
The Aircraft Context Dataset, a composition of two inter-compatible large-scale and versatile image datasets focusing on manned aircraft and UAVs, is intended for training and evaluating classification, detection and segmentation models in aerial domains. Additionally, a set of relevant meta-parameters can be used to quantify dataset variability as well as the impact of environmental conditions on model performance.
We design an all-day semantic segmentation benchmark all-day CityScapes. It is the first semantic segmentation benchmark that contains samples from all-day scenarios, i.e., from dawn to night. Our dataset will be made publicly available at [https://isis-data.science.uva.nl/cv/1ADcityscape.zip].
3 PAPERS • 1 BENCHMARK
DOORS is a dataset designed for boulders recognition, centroid regression, segmentation, and navigation applications. The dataset is divided into two sets:
DeepSportradar is a benchmark suite of computer vision tasks, datasets and benchmarks for automated sport understanding. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided.
The images in DukeMTMC-attribute dataset comes from Duke University. There are 1812 identities and 34183 annotated bounding boxes in the DukeMTMC-attribute dataset. This dataset contains 702 identities for training and 1110 identities for testing, corresponding to 16522 and 17661 images respectively. The attributes are annotated in the identity level, every image in this dataset is annotated with 23 attributes.
EDEN (Enclosed garDEN) is a multimodal synthetic dataset, a dataset for nature-oriented applications. The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow.
EgoHOS is a labeled dataset consisting of 11243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with during a diverse array of daily activities. The data are collected form multiple sources: 7,458 frames from Ego4D, 2,212 frames from EPIC-KITCHEN, 806 frames from THU-READ, and 350 frames of our own collected egocentric videos with people playing Escape Room. This dataset is designed for tasks including hand state classification, video activity recognition, 3D mesh reconstruction of hand-object interactions, and video inpainting of hand-object foregrounds in egocentric videos.
The Five-Billion-Pixels dataset contains more than 5 billion labeled pixels of 150 high-resolution Gaofen-2 (4 m) satellite images, annotated in a 24-category system covering artificial-constructed, agricultural, and natural classes. It possesses the advantage of rich categories, large coverage, wide distribution, and high-spatial resolution, which well reflects the distributions of real-world ground objects and can benefit to different land cover related studies.
Large-scale and open-access LiDAR dataset intended for the evaluation of real-time semantic segmentation algorithms. In contrast to other large-scale datasets, HelixNet includes fine-grained data about the sensor's rotation and position, as well as the points' release time.
Consists of user-generated aerial videos from social media with annotations of instance-level building damage masks. This provides the first benchmark for quantitative evaluation of models to assess building damage using aerial videos.
The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).
The dataset contains a Video capsule endoscopy dataset for polyp segmentation.
The MUAD dataset (Multiple Uncertainties for Autonomous Driving), consisting of 10,413 realistic synthetic images with diverse adverse weather conditions (night, fog, rain, snow), out-of-distribution objects, and annotations for semantic segmentation, depth estimation, object, and instance detection. Predictive uncertainty estimation is essential for the safe deployment of Deep Neural Networks in real-world autonomous systems and MUAD allows to a better assess the impact of different sources of uncertainty on model performance.
The Market1501-Attributes dataset is built from the Market1501 dataset. Market1501 Attribute is an augmentation of this dataset with 28 hand annotated attributes, such as gender, age, sleeve length, flags for items carried as well as upper clothes colors and lower clothes colors.
The “Medico automatic polyp segmentation challenge” aims to develop computer-aided diagnosis systems for automatic polyp segmentation to detect all types of polyps (for example, irregular polyp, smaller or flat polyps) with high efficiency and accuracy. The main goal of the challenge is to benchmark semantic segmentation algorithms on a publicly available dataset, emphasizing robustness, speed, and generalization.
We present a large-scale dataset for 3D urban scene understanding. Compared to existing datasets, our dataset consists of 75 outdoor urban scenes with diverse backgrounds, encompassing over 15,000 images. These scenes offer 360◦ hemispherical views, capturing diverse foreground objects illuminated under various lighting conditions. Additionally, our dataset encompasses scenes that are not limited to forward-driving views, addressing the limitations of previous datasets such as limited overlap and coverage between camera views. The closest pre-existing dataset for generalizable evaluation is DTU [2] (80 scenes) which comprises mostly indoor objects and does not provide multiple foreground objects or background scenes.
OSAI introduces OpenTTGames - an open dataset aimed at evaluation of different computer vision tasks in Table Tennis: ball detection, semantic segmentation of humans, table and scoreboard and fast in-game events spotting.
PETRAW data set was composed of 150 sequences of peg transfer training sessions. The objective of the peg transfer session is to transfer 6 blocks from the left to the right and back. Each block must be extracted from a peg with one hand, transferred to the other hand, and inserted in a peg at the other side of the board. All cases were acquired by a non-medical expert on the LTSI Laboratory from the University of Rennes. The data set was divided into a training data set composed of 90 cases and a test data set composed of 60 cases. A case was composed of kinematic data, a video, semantic segmentation of each frame, and workflow annotation.
3 PAPERS • 6 BENCHMARKS
RobotPush is a dataset for object singulation – the task of separating cluttered objects through physical interaction. The dataset contains 3456 training images with labels and 1024 validation images with labels. It consists of simulated and real-world data collected from a PR2 robot that equipped with a Kinect 2 camera. The dataset also contains ground truth instance segmentation masks for 110 images in the test set.
The SWINySEG dataset contains 6768 daytime- and nighttime-images of sky/cloud patches along with their corresponding binary ground truth maps. The images in the SWINySeg dataset are taken from two of our earlier sky/cloud image segmentation datasets -- SWIMSEG and SWINSEG. All images were captured in Singapore using WAHRSIS, a calibrated ground-based whole sky imager, over a period of 12 months from January to December 2016. The ground truth annotation was done in consultation with experts from Singapore Meteorological Services.
An open source Multi-View Overhead Imagery dataset with 27 unique looks from a broad range of viewing angles (-32.5 degrees to 54.0 degrees). Each of these images cover the same 665 square km geographic extent and are annotated with 126,747 building footprint labels, enabling direct assessment of the impact of viewpoint perturbation on model performance.
UNDD consists of 7125 unlabelled day and night images; additionally, it has 75 night images with pixel-level annotations having classes equivalent to Cityscapes dataset.
The Vocal Folds dataset is a dataset for automatic segmentation of laryngeal endoscopic images. The dataset consists of 8 sequences from 2 patients containing 536 hand segmented in vivo colour images of the larynx during two different resection interventions with a resolution of 512x512 pixels.
Created from endoscopic video feeds of real-world surgical procedures. Overall, the data consists of 307 images, each of which is annotated for the organs and different surgical instruments present in the scene.
ATLANTIS is a benchmark for semantic segmentation of waterbody images. This dataset covers a wide range of natural waterbodies such as sea, lake, river and man-made (artificial) water-related structures such as dam, reservoir, canal, and pier. ATLANTIS includes 5,195 pixel-wise annotated images split to 3,364 training, 535 validation, and 1,296 testing images. In addition to 35 waterbodies, this dataset covers 21 general labels such as person, car, road and building.
2 PAPERS • 1 BENCHMARK
CalCROP21 is a georeferenced multi-spectral dataset of satellite Imagery and crop labels. It is a semantic segmentation benchmark dataset, for the diverse crops in the Central Valley region of California at 10m spatial resolution using a Google Earth Engine based robust image processing pipeline.
2 PAPERS • NO BENCHMARKS YET
The dataset contains two subsets of synthetic, semantically segmented road-scene images, which have been created for developing and applying the methodology described in the paper "A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View" (IEEE Xplore, arXiv, YouTube)
2 PAPERS • 2 BENCHMARKS
The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken from the validation split of the Cityscapes dataset.
Colorectal Adenoma contains 177 whole slide images (156 contain adenoma) gathered and labelled by pathologists from the Department of Pathology, The Chinese PLA General Hospital.
A benchmark dataset for training and evaluating global cloud classification models. It consists of one year of 1km resolution MODIS hyperspectral imagery merged with pixel-width 'tracks' of CloudSat cloud labels.
A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying images from the GI tract into 23 distinct classes. The second task focuses on efficiant classification measured by the amount of time spent processing each image. The last task relates to automatcially segmenting polyps.
Extended Agriculture-Vision dataset comprises two parts:
This dataset is made up of forward-looking sonar images containing ten classes of underwater debris. The dataset can be used for segmentation or object detection. Applications include training computer vision models for underwater robotics applications.
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South Africa and the Low-Frequency Array (LOFAR) in the Netherlands. These datasets are intended to test radio-frequency interference (RFI) detection schemes. This entry pertains to the HERA dataset specifically.
Hephaestus is the first large-scale InSAR dataset. Driven by volcanic unrest detection, it provides 19,919 unique satellite frames annotated with a diverse set of labels. Moreover, each sample is accompanied by a textual description of its contents. The goal of this dataset is to boost research on exploitation of interferometric data enabling the application of state-of-the-art computer vision+NLP methods. Furthermore, the annotated dataset is bundled with a large archive of unlabeled frames to enable large-scale self-supervised learning. The final size of the dataset amounts to 110,573 interferograms.
HuTics contains 2040 images showing how humans use deictic gestures to interact with various daily-life objects. The images are annotated by segmentation masks of the object(s) of interest. The original purpose of the data collection is for gesture-aware object-agnostic segmentation tasks.
This dataset contains simulated and expert-labelled spectrograms from two radio telescopes: the Hydrogen Epoch of Reionization Array (HERA) in South Africa and the Low-Frequency Array (LOFAR) in the Netherlands. These datasets are intended to test radio-frequency interference (RFI) detection schemes. This entry pertains to the LOFAR dataset specifically.
LaRS is the largest and most diverse panoptic maritime obstacle detection dataset.
2 PAPERS • 3 BENCHMARKS
MVTec D2S is a benchmark for instance-aware semantic segmentation in an industrial domain. It contains 21,000 high-resolution images with pixel-wise labels of all object instances. The objects comprise groceries and everyday products from 60 categories. The benchmark is designed such that it resembles the real-world setting of an automatic checkout, inventory, or warehouse system. The training images only contain objects of a single class on a homogeneous background, while the validation and test sets are much more complex and diverse.
Meta Omnium is a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. Meta Omnium enables meta-learning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner.
Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and rural areas.