The Sku110k dataset provides 11,762 images with more than 1.7 million annotated bounding boxes captured in densely packed scenarios, including 8,233 images for training, 588 images for validation, and 2,941 images for testing. There are around 1,733,678 instances in total. The images are collected from thousands of supermarket stores and are of various scales, viewing angles, lighting conditions, and noise levels. All the images are resized into a resolution of one megapixel. Most of the instances in the dataset are tightly packed and typically of a certain orientation in the rage of [−15∘, 15∘].
20 PAPERS • 1 BENCHMARK
ModaNet is a street fashion images dataset consisting of annotations related to RGB images. ModaNet provides multiple polygon annotations for each image. Each polygon is associated with a label from 13 meta fashion categories. The annotations are based on images in the PaperDoll image set, which has only a few hundred images annotated by the superpixel-based tool.
19 PAPERS • 1 BENCHMARK
RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera and GPS/IMU. The data is collected in different weather scenarios (sunny, overcast, night, fog, rain and snow) to help the research community to develop new methods of vehicle perception. The radar images are annotated in 7 different scenarios: Sunny (Parked), Sunny/Overcast (Urban), Overcast (Motorway), Night (Motorway), Rain (Suburban), Fog (Suburban) and Snow (Suburban). The dataset contains 8 different types of objects (car, van, truck, bus, motorbike, bicycle, pedestrian and group of pedestrians).
19 PAPERS • 2 BENCHMARKS
TinyPerson is a benchmark for tiny object detection in a long distance and with massive backgrounds. The images in TinyPerson are collected from the Internet. First, videos with a high resolution are collected from different websites. Second, images from the video are sampled every 50 frames. Then images with a certain repetition (homogeneity) are deleted, and the resulting images are annotated with 72,651 objects with bounding boxes by hand.
19 PAPERS • NO BENCHMARKS YET
Argoverse-HD is a dataset built for streaming object detection, which encompasses real-time object detection, video object detection, tracking, and short-term forecasting. It contains the video data from Argoverse 1.1 with our own MS COCO-style bounding box annotations with track IDs. The annotations are backward-compatible with COCO as one can directly evaluate COCO pre-trained models on this dataset to estimate the efficiency or the cross-dataset generalization capability of the models. The dataset contains high-quality and temporally-dense annotations for high-resolution videos (1920 x 1200 @ 30 FPS). Overall, there are 70,000 image frames and 1.3 million bounding boxes.
17 PAPERS • 4 BENCHMARKS
OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. It is a partially annotated dataset, with 9,600 trainable classes
17 PAPERS • 3 BENCHMARKS
SceneNet is a dataset of labelled synthetic indoor scenes. There are several labeled indoor scenes, including:
17 PAPERS • NO BENCHMARKS YET
The Extended Optical Remote Sensing Saliency Detection (EORSSD) dataset is an extension of the ORSSD dataset. This new dataset is larger and more varied than the original. It contains 2,000 images and corresponding pixel-wise ground truth, which includes many semantically meaningful but challenging images.
16 PAPERS • NO BENCHMARKS YET
SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.
16 PAPERS • 3 BENCHMARKS
The MALF dataset is a large dataset with 5,250 images annotated with multiple facial attributes and it is specifically constructed for fine grained evaluation.
15 PAPERS • NO BENCHMARKS YET
RPC is a large-scale retail product checkout dataset and collects 200 retail SKUs. The collected SKUs can be divided into 17 meta categories, i.e., puffed food, dried fruit, dried food, instant drink, instant noodles, dessert, drink, alcohol, milk, canned food, chocolate, gum, candy, seasoner, personal hygiene, tissue, stationery.
Washington RGB-D is a widely used testbed in the robotic community, consisting of 41,877 RGB-D images organized into 300 instances divided in 51 classes of common indoor objects (e.g. scissors, cereal box, keyboard etc). Each object instance was positioned on a turntable and captured from three different viewpoints while rotating.
MinneApple is a benchmark dataset for apple detection and segmentation. The fruits are labelled using polygonal masks for each object instance to aid in precise object detection, localization, and segmentation. Additionally, the dataset also contains data for patch-based counting of clustered fruits. The dataset contains over 41, 000 annotated object instances in 1000 images.
14 PAPERS • NO BENCHMARKS YET
Unconstrained Face Detection Dataset (UFDD) aims to fuel further research in unconstrained face detection.
13 PAPERS • NO BENCHMARKS YET
People-Art is an object detection dataset which consists of people in 43 different styles. People contained in this dataset are quite different from those in common photographs. There are 42 categories of art styles and movements including Naturalism, Cubism, Socialist Realism, Impressionism, and Suprematism
11 PAPERS • 2 BENCHMARKS
SODA10M is a large-scale object detection benchmark for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data. SODA10M contains 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes.
11 PAPERS • NO BENCHMARKS YET
TJU-DHD is a high-resolution dataset for object detection and pedestrian detection. The dataset contains 115,354 high-resolution images (52% images have a resolution of 1624×1200 pixels and 48% images have a resolution of at least 2,560×1,440 pixels) and 709,330 labelled objects in total with a large variance in scale and appearance.
AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others.
10 PAPERS • 1 BENCHMARK
DeepScores contains high quality images of musical scores, partitioned into 300,000 sheets of written music that contain symbols of different shapes and sizes. For advancing the state-of-the-art in small objects recognition, and by placing the question of object recognition in the context of scene understanding.
10 PAPERS • NO BENCHMARKS YET
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
10 PAPERS • 2 BENCHMARKS
WiderPerson contains a total of 13,382 images with 399,786 annotations, i.e., 29.87 annotations per image, which means this dataset contains dense pedestrians with various kinds of occlusions. Hence, pedestrians in the proposed dataset are extremely challenging due to large variations in the scenario and occlusion, which is suitable to evaluate pedestrian detectors in the wild.
9 PAPERS • 1 BENCHMARK
Description Detection Dataset ($D^3$, /dikju:b/) is an attempt at creating a next-generation object detection dataset. Unlike traditional detection datasets, the class names of the objects are no longer simple nouns or noun phrases, but rather complex and descriptive, such as a dog not being held by a leash. For each image in the dataset, any object that matches the description is annotated. The dataset provides annotations such as bounding boxes and finely crafted instance masks.It comprises of 422 well-designed descriptions and 24,282 positive object-description pairs.
8 PAPERS • 1 BENCHMARK
The RIT-18 dataset was built for the semantic segmentation of remote sensing imagery. It was collected with the Tetracam Micro-MCA6 multispectral imaging sensor flown on-board a DJI-1000 octocopter.
8 PAPERS • NO BENCHMARKS YET
ReDWeb-S is a large-scale challenging dataset for Salient Object Detection. It has totally 3179 images with various real-world scenes and high-quality depth maps. The dataset is split into a training set with 2179 RGB-D image pairs and a testing set with the remaining 1000 image pairs.
SKU110K-R is a dataset relabeled with oriented bounding boxes based on SKU110K. It is focused on evaluating oriented and densely packed object detection.
The TrashCan dataset is an instance-segmentation dataset of underwater trash. It is comprised of annotated images (7,212 images) which contain observations of trash, ROVs, and a wide variety of undersea flora and fauna. The annotations in this dataset take the format of instance segmentation annotations: bitmaps containing a mask marking which pixels in the image contain each object. The imagery in TrashCan is sourced from the J-EDI (JAMSTEC E-Library of Deep-sea Images) dataset, curated by the Japan Agency of Marine Earth Science and Technology (JAMSTEC).
A Multi-Task 4D Radar-Camera Fusion Dataset for Autonomous Driving on Water Surfaces description of the dataset
8 PAPERS • 2 BENCHMARKS
Bamboo Dataset is a mega-scale and information-dense dataset for both classification and detection pre-training. It is built upon integrating 24 public datasets (e.g. ImagenNet, Places365, Object365, OpenImages) and added new annotations through active learning. Bamboo has 69M image classification annotations and 32M object bounding boxes.
7 PAPERS • NO BENCHMARKS YET
BigDetection is a new large-scale benchmark to build more general and powerful object detection systems. It leverages the training data from existing datasets (LVIS, OpenImages and Object365) with carefully designed principles, and curate a larger dataset for improved detector pre-training. BigDetection dataset has 600 object categories and contains 3.4M training images with 36M object bounding boxes.
7 PAPERS • 1 BENCHMARK
Cops-Ref is a dataset for visual reasoning in context of referring expression comprehension with two main features.
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
SpaceNet 2: Building Detection v2 - is a dataset for building footprint detection in geographically diverse settings from very high resolution satellite images. It contains over 302,701 building footprints, 3/8-band Worldview-3 satellite imagery at 0.3m pixel res., across 5 cities (Rio de Janeiro, Las Vegas, Paris, Shanghai, Khartoum), and covers areas that are both urban and suburban in nature. The dataset was split using 60%/20%/20% for train/test/validation.
TTPLA is a public dataset which is a collection of aerial images on Transmission Towers (TTs) and Power Lines (PLs). It can be used for detection and segmentation of transmission towers and power lines. It consists of 1,100 images with the resolution of 3,840×2,160 pixels, as well as manually labelled 8,987 instances of TTs and PLs.
APRICOT is a collection of over 1,000 annotated photographs of printed adversarial patches in public locations. The patches target several object categories for three COCO-trained detection models, and the photos represent natural variation in position, distance, lighting conditions, and viewing angle.
6 PAPERS • NO BENCHMARKS YET
BAAI-VANJEE is a dataset for benchmarking and training various computer vision tasks such as 2D/3D object detection and multi-sensor fusion. The BAAI-VANJEE roadside dataset consists of LiDAR data and RGB images collected by VANJEE smart base station placed on the roadside about 4.5m high. This dataset contains 2500 frames of LiDAR data, 5000 frames of RGB images, including 20% collected at the same time. It also contains 12 classes of objects, 74K 3D object annotations and 105K 2D object annotations.
The EuroCity Persons dataset provides a large number of highly diverse, accurate and detailed annotations of pedestrians, cyclists and other riders in urban traffic scenes. The images for this dataset were collected on-board a moving vehicle in 31 cities of 12 European countries. With over 238,200 person instances manually labeled in over 47,300 images, EuroCity Persons is nearly one order of magnitude larger than person datasets used previously for benchmarking. The dataset furthermore contains a large number of person orientation annotations (over 211,200).
Falling Things (FAT) is a dataset for advancing the state-of-the-art in object detection and 3D pose estimation in the context of robotics. It consists of generated photorealistic images with accurate 3D pose annotations for all objects in 60k images.
Freiburg Groceries is a groceries classification dataset consisting of 5000 images of size 256x256, divided into 25 categories. It has imbalanced class sizes ranging from 97 to 370 images per class. Images were taken in various aspect ratios and padded to squares.
IIIT-AR-13K is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories - table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection.
Lytro Illum is a new light field dataset using a Lytro Illum camera. 640 light fields are collected with significant variations in terms of size, textureness, background clutter and illumination, etc. Micro-lens image arrays and central viewing images are generated, and corresponding ground-truth maps are produced.
MobilityAids is a dataset for perception of people and their mobility aids. The annotated dataset contains five classes: pedestrian, person in wheelchair, pedestrian pushing a person in a wheelchair, person using crutches and person using a walking frame. In total the hospital dataset has over 17, 000 annotated RGB-D images, containing people categorized according to the mobility aids they use. The images were collected in the facilities of the Faculty of Engineering of the University of Freiburg and in a hospital in Frankfurt.
The PS-Battles dataset is gathered from a large community of image manipulation enthusiasts and provides a basis for media derivation and manipulation detection in the visual domain. The dataset consists of 102'028 images grouped into 11'142 subsets, each containing the original image as well as a varying number of manipulated derivatives.
The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modifications over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.
5 PAPERS • NO BENCHMARKS YET
DUO is a dataset for Underwater object detection for robot picking. The dataset contains a collection of diverse underwater images with more rational annotations.
Breast MRI scans of 922 cancer patients from Duke University, with tumor bounding box annotations, clinical, imaging, and many other features, and more.
HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.
Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.
5 PAPERS • 1 BENCHMARK
Satlas is a remote sensing dataset and benchmark that is large in both breadth, featuring all of the aforementioned applications and more, as well as scale, comprising 290M labels under 137 categories and 7 label modalities.