Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64×64 colored images. Each class has 500 training images, 50 validation images and 50 test images.
952 PAPERS • 8 BENCHMARKS
EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 x 28 format.
234 PAPERS • 9 BENCHMARKS
The PlantVillage dataset consists of 54303 healthy and unhealthy leaf images divided into 38 categories by species and disease.
62 PAPERS • 1 BENCHMARK
Omni-Realm Benchmark (OmniBenchmark) is a diverse (21 semantic realm-wise datasets) and concise (realm-wise datasets have no concepts overlapping) benchmark for evaluating pre-trained model generalization across semantic super-concepts/realms, e.g. across mammals to aircraft.
28 PAPERS • 1 BENCHMARK
Imagenette is a subset of 10 easily classified classes from Imagenet (bench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).
24 PAPERS • 1 BENCHMARK
The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:
22 PAPERS • 2 BENCHMARKS
Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.
12 PAPERS • 2 BENCHMARKS
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
10 PAPERS • 2 BENCHMARKS
We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32×32, 256×256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks.
7 PAPERS • 1 BENCHMARK
Approx. 300,000 images of galaxies labelled by shape.
5 PAPERS • NO BENCHMARKS YET
CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in \{1, 0\}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in \{0, 1\}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.
4 PAPERS • NO BENCHMARKS YET
It includes 47,978 butterfly images with a 4-level label-hierarchy. Hierarchy of labels from the ETHEC dataset across 4 levels: family, sub-family, genus and species. 6 family -> 21 sub-family -> 135 genus -> 561 species
InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 unique power line assets captured from real-world operating power lines. Some of those assets (five, to be precise) are also annotated regarding their conditions. They present the following defects: corrosion (4 of them), broken/missing cap (1 of them), and bird's nest presence (1 of them).
4 PAPERS • 1 BENCHMARK
The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).
The MNIST Large Scale dataset is based on the classic MNIST dataset, but contains large scale variations up to a factor of 16. The motivation behind creating this dataset was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over wide scale ranges.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
Sewer-ML is a sewer defect dataset. It contains 1.3 million images, from 75,618 videos collected from three Danish water utility companies over nine years. All videos have been annotated by licensed sewer inspectors following the Danish sewer inspection standard, Fotomanualen. This leads to consistent and reliable annotations, and a total of 17 annotated defect classes.
Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE RACING, CAR RACING, FIGHTING, HOCKEY, SOCCER, TABLE TENNIS, TENNIS.
4 PAPERS • 2 BENCHMARKS
This is the first image-based network intrusion detection dataset. This large-scale dataset included network traffic protocol communication-based images from 15 different observation locations of different countries in Asia. This dataset is used to identify two different types of anomalies from benign network traffic. Each image with a size of 48 × 48 contains multi-protocol communications within 128 seconds. The SIDD dataset can be to applied to a broad range of tasks such as machine learning-based network intrusion detection, non-iid federated learning, and so forth.
3 PAPERS • 1 BENCHMARK
The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.
2 PAPERS • 1 BENCHMARK
DeepPCB
A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying images from the GI tract into 23 distinct classes. The second task focuses on efficiant classification measured by the amount of time spent processing each image. The last task relates to automatcially segmenting polyps.
ImageNet-Hard is a new benchmark that comprises 10,980 images collected from various existing ImageNet-scale benchmarks (ImageNet, ImageNet-V2, ImageNet-Sketch, ImageNet-C, ImageNet-R, ImageNet-ReaL, ImageNet-A, and ObjectNet). This dataset poses a significant challenge to state-of-the-art vision models as merely zooming in often fails to improve their ability to classify images correctly. As a result, even the most advanced models, such as CLIP-ViT-L/14@336px, struggle to perform well on this dataset, achieving a mere 2.02% accuracy.
Kvasir-Capsule dataset is the largest publicly released VCE dataset. In total, the dataset contains 47,238 labeled images and 117 videos, where it captures anatomical landmarks and pathological and normal findings. The results is more than 4,741,621 images and video frames altogether.
2 PAPERS • NO BENCHMARKS YET
ArtDL is a novel painting data set for iconography classification composed of images collected from online sources. Most of the paintings are from the Renaissance period and depict scenes or characters of Christian art. The data set is annotated with classes representing specific characters belonging to the Iconclass classification system.
1 PAPER • 1 BENCHMARK
Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use.
1 PAPER • NO BENCHMARKS YET
CNFOOD-241 Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use. CNFOOD-241-Chen is the CNFOOD-241 dataset spilt with the list introduced in the paper "Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning," which has random split as train, val, test three parts.
The standard evaluation protocol of Cross-View Time dataset allows for certain cameras to be shared between training and testing sets. This protocol can emulate scenarios in which we need to verify the authenticity of images from a particular set of devices and locations. Considering the ubiquity of surveillance systems (CCTV) nowadays, this is a common scenario, especially for big cities and high visibility events (e.g., protests, musical concerts, terrorist attempts, sports events). In such cases, we can leverage the availability of historical photographs of that device and collect additional images from previous days, months, and years. This would allow the model to better capture the particularities of how time influences the appearance of that specific place, probably leading to a better verification accuracy. However, there might be cases in which data is originated from heterogeneous sources, such as social media. In this sense, it is essential that models are optimized on camer
ISBNet is a dataset of images of recyclables. It is hand collected by our group at the International School of Beijing. The trash in these images was gathered from trash bins around the school. ISBNet totals 889 images distributed across 5 classes: cans (74), landfill (410), paper (182), plastic (122), and tetra pak (101). The data acquisition process involved using a piece of black poster paper as a background; this would create enough contrast for trash belonging to the paper category. These pictures were taken with an iPhone 8 and an iPhone XS. We recorded the trash bin in which the piece of trash originated from and any trash generating landmarks nearby. Please refer to the paper (ThanosNet: A Novel Trash Classification Method Using Metadata) for more about the format of the metadata.
There was no predefined dataset of party symbols to be usedas a benchmark. We curated a dataset from various nationaland regional websites owned by the ECI. The dataset consists of symbols (image files) of 49 National and State registered parties approved by the ECI. For each image of theoriginal party symbol, 18 different distortions and transformations were created as variations to the training data. Each image is of the dimension 180 x 180. The final labeled dataset consists of 931 images of party symbols with their corresponding party names as the labels.
This dataset is recreated using offline augmentation from the original dataset. The original dataset can be found on this github repo. This dataset consists of about 87K rgb images of healthy and diseased crop leaves which is categorized into 38 different classes. The total dataset is divided into 80/20 ratio of training and validation set preserving the directory structure. A new directory containing 33 test images is created later for prediction purpose.
The AASL-Clear dataset is a collection of RGB images featuring Arabic alphabet sign Language gestures with backgrounds removed. Each image in this dataset showcases clear, isolated hand gestures, allowing for precise recognition and analysis of Arabic sign language alphabets. With transparent backgrounds, this dataset provides a clean and focused resource for training deep learning models in the domain of Arabic sign language recognition and classification.
Our proposed Synthetic-to-Real benchmark for more practical visual DA (termed S2RDA) includes two challenging transfer tasks of S2RDA-49 and S2RDA-MS-39. In each task, source/synthetic domain samples are synthesized by rendering 3D models from ShapeNet. The used 3D models are in the same label space as the target/real domain and each class has 12K rendered RGB images. The real domain of S2RDA-49 comprises 60,535 images of 49 classes, collected from ImageNet validation set, ObjectNet, VisDA-2017 validation set, and the web. For S2RDA-MS-39, the real domain collects 41,735 natural images exclusive for 39 classes from MetaShift, which contain complex and distinct contexts, e.g., object presence (co-occurrence of different objects), general contexts (indoor or outdoor), and object attributes (color or shape), leading to a much harder task. Compared to VisDA-2017, our S2RDA contains more categories, more realistically synthesized source domain data coming for free, and more complicated targ
The social vision and language dataset is a large-scale multimodal dataset designed for research into social contextual learning.
A public open dataset of synthetic chest X-ray images of COVID-19.
Tsinghua Dogs is a fine-grained classification dataset for dogs, over 65% of whose images are collected from people's real life. Each dog breed in the dataset contains at least 200 images and a maximum of 7,449 images, basically in proportion to their frequency of occurrence in China, so it significantly increases the diversity for each breed over existing dataset. Furthermore, Tsinghua Dogs annotated bounding boxes of the dog’s whole body and head in each image, which can be used for supervising the training of learning algorithms as well as testing them.
topex-printer is a dataset containing 102 machine parts of a label printing machine. It includes these parts for two domains, real photos and CAD rendered models.
ADFI Dataset is an image dataset for anomaly detection methods with a focus on industrial inspection. Each category sub dataset comprises a training set of images and a test set of images with various kinds of defects as well as images without defects.
0 PAPER • NO BENCHMARKS YET
Dataset contains images with apples infected by scab. The images are grouped in two folders: "Healthy" and "Scab". The collection of digital images were carried out in different locations of Latvia. Digital images with characteristic scab symptoms on fruits were collected by the Institute of Horticulture (LatHort) under project "lzp-2019/1-0094 Application of deep learning and datamining for the study of plant-pathogen interaction: the case of apple and pear scab" with a goal to create mobile application for apple scab detection using convolution neural networks. Devices: smartphone cameras (12 MP, 13 MP, 48 MP) and a digital compact camera (10 MP). The collection of images was carried out in field conditions, in orchards. The images were taken at three different stages of the day - in the morning (9:00-10:00), around noon (12:00-14:00), as well as in the evening (16:00-17:00) to provide a variety of natural light conditions. The images were also taken on both sunny days and overcast d
Dataset contains images with apple leaves infected by scab. The images are grouped in two folders: "Healthy" and "Scab". The collection of digital images were carried out in different locations of Latvia. Digital images with characteristic scab symptoms on leaves were collected by the Institute of Horticulture (LatHort) under project "lzp-2019/1-0094 Application of deep learning and datamining for the study of plant-pathogen interaction: the case of apple and pear scab" with a goal to create mobile application for apple scab detection using convolution neural networks. Devices: smartphone cameras (12 MP, 13 MP, 48 MP) and a digital compact camera (10 MP). The collection of images was carried out in field conditions, in orchards. The images were taken at three different stages of the day - in the morning (9:00-10:00), around noon (12:00-14:00), as well as in the evening (16:00-17:00) to provide a variety of natural light conditions. The images were also taken on both sunny days and over
This dataset is the images of corn seeds considering the top and bottom view independently (two images for one corn seed: top and bottom). There are four classes of the corn seed (Broken-B, Discolored-D, Silkcut-S, and Pure-P) 17802 images are labeled by the experts at the AdTech Corp. and 26K images were unlabeled out of which 9k images were labeled using the Active Learning (BatchBALD)
Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.