Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
3,335 PAPERS • 54 BENCHMARKS
CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.
3,095 PAPERS • 20 BENCHMARKS
The Omniglot data set is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon's Mechanical Turk by 20 different people. Each image is paired with stroke data, a sequences of [x,y,t] coordinates with time (t) in milliseconds.
86 PAPERS • 15 BENCHMARKS
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding. It contains 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry.
62 PAPERS • 1 BENCHMARK
Aff-Wild2 is an extension of the Aff-Wild dataset for affect recognition. It approximately doubles the number of included video frames and the number of subjects; thus, improving the variability of the included behaviors and of the involved persons.
56 PAPERS • 1 BENCHMARK
The Wireframe dataset consists of 5,462 images (5,000 for training, 462 for test) of indoor and outdoor man-made scenes.
53 PAPERS • 2 BENCHMARKS
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels.
20 PAPERS • 1 BENCHMARK
SemArt is a multi-modal dataset for semantic art understanding. SemArt is a collection of fine-art painting images in which each image is associated to a number of attributes and a textual artistic comment, such as those that appear in art catalogues or museum collections. It contains 21,384 samples that provides artistic comments along with fine-art paintings and their attributes for studying semantic art understanding.
13 PAPERS • NO BENCHMARKS YET
WWW Crowd provides 10,000 videos with over 8 million frames from 8,257 diverse scenes, therefore offering a comprehensive dataset for the area of crowd understanding.
5 PAPERS • NO BENCHMARKS YET
The CropAndWeed dataset is focused on the fine-grained identification of 74 relevant crop and weed species with a strong emphasis on data variability. Annotations of labeled bounding boxes, semantic masks and stem positions are provided for about 112k instances in more than 8k high-resolution images of both real-world agricultural sites and specifically cultivated outdoor plots of rare weed types. Additionally, each sample is enriched with meta-annotations regarding environmental conditions.
4 PAPERS • NO BENCHMARKS YET
An annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions.
PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books.
4 PAPERS • 1 BENCHMARK
The Cifar10Mnist dataset is created using CIFAR-10 and MNIST data sources. Since the CIFAR-10 training set consists of 50000 images and the MNIST training set contains 60000 digits, the first 50000 digits from MNIST are padded on top of the CIFAR-10 images after making them slightly translucent. A first training dataset is then obtained (50000 images). Furthermore, the remaining 10000 MNIST digits are padded on top of 10000 random CIFAR10 images (with a fixed seed). This gives the possibility of having a second training dataset of 60000 images. For the test set, the 10000 CIFAR-10 images are padded over the 10000 MNIST digits.
1 PAPER • NO BENCHMARKS YET
ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.
The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2, 370 season-wise samples from 388 unique plots, having
1 PAPER • 5 BENCHMARKS
The pic2kal benchmark for calorie prediction contains 308,000 images from over 70,000 recipes including photographs, ingredients and instructions, matched with nutritional information.
CVGL Camera Calibration Dataset consists of 49 camera configurations with town 1 having 25 configurations while town 2 having 24 configurations. The parameters modified for generating the configurations include fov, x, y, z, pitch, yaw, and roll. Here, fov is the field of view, (x, y, z) is the translation while (pitch, yaw, and roll) is the rotation between the cameras. The total number of image pairs is 79, 320, out of which 18, 083 belong to Town 1 while 61, 237 belong to Town 2, the difference in the number of images is due to the length of the tracks.
0 PAPER • NO BENCHMARKS YET
a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.