This work presents two new benchmark datasets (CIFAR-10N, CIFAR-100N), equipping the training dataset of CIFAR-10 and CIFAR-100 with human-annotated real-world noisy labels that we collect from Amazon Mechanical Turk.
77 PAPERS • 6 BENCHMARKS
Comprises 11 hand gesture categories from 29 subjects under 3 illumination conditions.
77 PAPERS • 5 BENCHMARKS
ChestX-ray8 is a medical imaging dataset which comprises 108,948 frontal-view X-ray images of 32,717 (collected from the year of 1992 to 2015) unique patients with the text-mined eight common disease labels, mined from the text radiological reports via NLP techniques.
76 PAPERS • NO BENCHMARKS YET
ImageNet-O consists of images from classes that are not found in the ImageNet-1k dataset. It is used to test the robustness of vision models to out-of-distribution samples. It's reported using the AUPR metric.
AbstractReasoning is a dataset for abstract reasoning, where the goal is to infer the correct answer from the context panels based on abstract reasoning.
74 PAPERS • NO BENCHMARKS YET
The Oxford-IIIT Pet Dataset has 37 categories with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation.
73 PAPERS • 5 BENCHMARKS
BigEarthNet consists of 590,326 Sentinel-2 image patches, each of which is a section of i) 120x120 pixels for 10m bands; ii) 60x60 pixels for 20m bands; and iii) 20x20 pixels for 60m bands.
64 PAPERS • 3 BENCHMARKS
The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.
61 PAPERS • 1 BENCHMARK
The Places365 dataset is a scene recognition dataset. It is composed of 10 million images comprising 434 scene classes. There are two versions of the dataset: Places365-Standard with 1.8 million train and 36000 validation images from K=365 scene classes, and Places365-Challenge-2016, in which the size of the training set is increased up to 6.2 million extra images, including 69 new scene classes (leading to a total of 8 million train images from 434 scene classes).
55 PAPERS • 8 BENCHMARKS
MINC is a large-scale, open dataset of materials in the wild.
53 PAPERS • NO BENCHMARKS YET
50 PAPERS • 1 BENCHMARK
PGM dataset serves as a tool for studying both abstract reasoning and generalisation in models. Generalisation is a multi-faceted phenomenon; there is no single, objective way in which models can or should generalise beyond their experience. The PGM dataset provides a means to measure the generalization ability of models in different ways, each of which may be more or less interesting to researchers depending on their intended training setup and applications.
49 PAPERS • NO BENCHMARKS YET
Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.
48 PAPERS • 3 BENCHMARKS
The MultiMNIST dataset is generated from MNIST. The training and tests are generated by overlaying a digit on top of another digit from the same set (training or test) but different class. Each digit is shifted up to 4 pixels in each direction resulting in a 36×36 image. Considering a digit in a 28×28 image is bounded in a 20×20 box, two digits bounding boxes on average have 80% overlap. For each digit in the MNIST dataset 1,000 MultiMNIST examples are generated, so the training set size is 60M and the test set size is 10M.
47 PAPERS • 1 BENCHMARK
AI2 Diagrams (AI2D) is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations, their ground truth syntactic parses, and more than 15000 corresponding multiple choice questions.
45 PAPERS • 1 BENCHMARK
Tiny-ImageNet-C is an open-source data set comprising algorithmically generated corruptions (blur, noise) applied to the Tiny-ImageNet (ImageNet-200) test-set.
44 PAPERS • NO BENCHMARKS YET
The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large variations in scale, pose, and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel-level trimap segmentation.
42 PAPERS • 7 BENCHMARKS
CARS196 is composed of 16,185 car images of 196 classes.
41 PAPERS • 4 BENCHMARKS
JFT-3B is an internal Google dataset and a larger version of the JFT-300M dataset. It consists of nearly 3 billion images, annotated with a class-hierarchy of around 30k labels via a semi-automatic pipeline. In other words, the data and associated labels are noisy.
37 PAPERS • NO BENCHMARKS YET
The SUN Attribute dataset consists of 14,340 images from 717 scene categories, and each category is annotated with a taxonomy of 102 discriminate attributes. The dataset can be used for high-level scene understanding and fine-grained scene recognition.
37 PAPERS • 2 BENCHMARKS
The Scene UNderstanding (SUN) database contains 899 categories and 130,519 images. There are 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition.
34 PAPERS • 5 BENCHMARKS
Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images) are provided. The images often show complex scenes with several objects (8 annotated objects per image on average). Visual relationships between them are annotated, which support visual relationship detection, an emerging task that requires structured reasoning.
32 PAPERS • 1 BENCHMARK
Visual Wake Words represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models.
30 PAPERS • 2 BENCHMARKS
Imagenet64 is a massive dataset of small images called the down-sampled version of Imagenet. Imagenet64 comprises 1,281,167 training data and 50,000 test data with 1,000 labels.
29 PAPERS • 1 BENCHMARK
ImageNet-P consists of noise, blur, weather, and digital distortions. The dataset has validation perturbations; has difficulty levels; has CIFAR-10, Tiny ImageNet, ImageNet 64 × 64, standard, and Inception-sized editions; and has been designed for benchmarking not training networks. ImageNet-P departs from ImageNet-C by having perturbation sequences generated from each ImageNet validation image. Each sequence contains more than 30 frames, so to counteract an increase in dataset size and evaluation time only 10 common perturbations are used.
28 PAPERS • 1 BENCHMARK
Omni-Realm Benchmark (OmniBenchmark) is a diverse (21 semantic realm-wise datasets) and concise (realm-wise datasets have no concepts overlapping) benchmark for evaluating pre-trained model generalization across semantic super-concepts/realms, e.g. across mammals to aircraft.
Object detection benchmark for logo detection.
26 PAPERS • 3 BENCHMARKS
Million-AID is a large-scale benchmark dataset containing a million instances for RS scene classification. There are 51 semantic scene categories in Million-AID. And the scene categories are customized to match the land-use classification standards, which greatly enhance the practicability of the constructed Million-AID. Different form the existing scene classification datasets of which categories are organized with parallel or uncertain relationships, scene categories in Million-AID are organized with systematic relationship architecture, giving it superiority in management and scalability. Specifically, the scene categories in Million-AID are organized by the hierarchical category network of a three-level tree: 51 leaf nodes fall into 28 parent nodes at the second level which are grouped into 8 nodes at the first level, representing the 8 underlying scene categories of agriculture land, commercial land, industrial land, public service land, residential land, transportation land, unut
26 PAPERS • NO BENCHMARKS YET
The Behance Artistic Media dataset (BAM!) is a large-scale dataset of contemporary artwork from Behance, a website containing millions of portfolios from professional and commercial artists. We annotate Behance imagery with rich attribute labels for content, emotions, and artistic media. We believe our Behance Artistic Media dataset will be a good starting point for researchers wishing to study artistic imagery and relevant problems.
25 PAPERS • NO BENCHMARKS YET
Imagenette is a subset of 10 easily classified classes from Imagenet (bench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).
24 PAPERS • 1 BENCHMARK
The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:
23 PAPERS • 2 BENCHMARKS
The exact pre-processing steps used to construct the MNIST dataset have long been lost. This leaves us with no reliable way to associate its characters with the ID of the writer and little hope to recover the full MNIST testing set that had 60K images but was never released. The official MNIST testing set only contains 10K randomly sampled images and is often considered too small to provide meaningful confidence intervals. The QMNIST dataset was generated from the original data found in the NIST Special Database 19 with the goal to match the MNIST preprocessing as closely as possible. QMNIST is licensed under the BSD-style license.
Dataset with 28,792 retinal images from the EyePACS dataset, based on a three-level quality grading system (i.e., Good',Usable' and `Reject') for evaluating RIQA methods.
22 PAPERS • NO BENCHMARKS YET
ImageNet-W(atermark) is a test set to evaluate models’ reliance on the newly found watermark shortcut in ImageNet, which is used to predict the carton class. ImageNet-W is created by overlaying transparent watermarks on the ImageNet validation set. Two metrics are used to evaluate watermark shortcut reliance: (1) IN-W Gap: the top-1 accuracy drop from ImageNet to ImageNet-W, (2) Carton Gap: carton class accuracy increase from ImageNet to ImageNet-W. Combining ImageNet-W with previous out-of-distribution variants of ImageNet (e.g., Stylized ImageNet, ImageNet-R, ImageNet-9) forms a comprehensive suite of multi-shortcut evaluation on ImageNet.
22 PAPERS • 1 BENCHMARK
Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform well on images from another domain. Complementing existing work in robustness testing, we introduce the first test dataset for this purpose which comes from an authentic use case where photographers wanted to learn about the content in their images. We built a new test set using 8,900 images taken by people who are blind for which we collected metadata to indicate the presence versus absence of 200 ImageNet object categories. We call this dataset VizWiz-Classification.
21 PAPERS • 3 BENCHMARKS
iNat2021 is a large-scale image dataset collected and annotated by community scientists that contains over 2.7M images from 10k different species.
20 PAPERS • 1 BENCHMARK
Includes 5,824 fundus images labeled with either positive glaucoma (2,392) or negative glaucoma (3,432).
18 PAPERS • 1 BENCHMARK
UrbanCars facilitates multi-shortcut learning under the controlled setting with two shortcuts—background and co-occurring object. The task is classifying the car body type into two categories: urban car and country car. The dataset contains three splits: training, validation, and testing. In the training set, two shortcuts spuriously correlate with the car body type. Both validation and testing sets are balanced, i.e., no spurious correlations. The validation set is used for model selection, and the testing set evaluates the mitigation of two shortcuts.
ETHOS is a hate speech detection dataset. It is built from YouTube and Reddit comments validated through a crowdsourcing platform. It has two subsets, one for binary classification and the other for multi-label classification. The former contains 998 comments, while the latter contains fine-grained hate-speech annotations for 433 comments.
17 PAPERS • 2 BENCHMARKS
IP102 contains more than 75,000 images belonging to 102 categories, which exhibit a natural long-tailed distribution.
17 PAPERS • 1 BENCHMARK
A composite dataset that unifies semantic segmentation datasets from different domains.
17 PAPERS • NO BENCHMARKS YET
An annotated image memorability dataset to date (with 60,000 labeled images from a diverse array of sources).
16 PAPERS • NO BENCHMARKS YET
See paper:
13 PAPERS • 2 BENCHMARKS
Brief Description The Neuromorphic-MNIST (N-MNIST) dataset is a spiking version of the original frame-based MNIST dataset. It consists of the same 60 000 training and 10 000 testing samples as the original MNIST dataset, and is captured at the same visual scale as the original MNIST dataset (28x28 pixels). The N-MNIST dataset was captured by mounting the ATIS sensor on a motorized pan-tilt unit and having the sensor move while it views MNIST examples on an LCD monitor as shown in this video. A full description of the dataset and how it was created can be found in the paper below. Please cite this paper if you make use of the dataset.
13 PAPERS • 1 BENCHMARK
TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a tram, travelling by a bus, travelling by an underground metro and urban park. Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.
Update on 3DIdent, where we introduce six additional object classes (Hare, Dragon, Cow, Armadillo, Horse, and Head), and impose a causal graph over the latent variables. For further details, see Appendix B in the associated paper (https://arxiv.org/abs/2106.04619).
12 PAPERS • 1 BENCHMARK
Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.
12 PAPERS • 2 BENCHMARKS
PlantDoc is a dataset for visual plant disease detection. The dataset contains 2,598 data points in total across 13 plant species and up to 17 classes of diseases, involving approximately 300 human hours of effort in annotating internet scraped images.