🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

790 dataset results for Images AND English

The Sunnybrook Cardiac Data (SCD), also known as the 2009 Cardiac MR Left Ventricle Segmentation Challenge data, consist of 45 cine-MRI images from a mixed of patients and pathologies: healthy, hypertrophy, heart failure with infarction and heart failure without infarction. Subset of this data set was first used in the automated myocardium segmentation challenge from short-axis MRI, held by a MICCAI workshop in 2009. The whole complete data set is now available in the CAP database with public domain license.

5 PAPERS • NO BENCHMARKS YET

TRANCE (Transformation Driven Visual Reasoning)

TRANCE extends CLEVR by asking a uniform question, i.e. what is the transformation between two given images, to test the ability of transformation reasoning. TRANCE includes three levels of settings, i.e. Basic (single-step transformation), Event (multi-step transformation), and View (multi-step transformation with variant views). Detailed information can be found in https://hongxin2019.github.io/TVR.

5 PAPERS • NO BENCHMARKS YET

UMVM

We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.

5 PAPERS • 7 BENCHMARKS

VideoCube

VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT is a high-quality and multi-modal benchmark based on VideoCube-Tiny to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content.

5 PAPERS • NO BENCHMARKS YET

XImageNet-12 (XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation)

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment!

5 PAPERS • 1 BENCHMARK

ZS-F-VQA

The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.

5 PAPERS • 1 BENCHMARK

ArtiFact (Artificial and Factual Image Dataset for Synthetic Image Detection)

The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.

4 PAPERS • NO BENCHMARKS YET

AutoChart

AutoChart is a dataset for chart-to-text generation, a task that consists on generating analytical descriptions of visual plots.

4 PAPERS • NO BENCHMARKS YET

Bentham (Bentham project)

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.

4 PAPERS • 1 BENCHMARK

CHOCOLATE (Captions Have Often ChOsen Lies About The Evidence)

CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by six advanced models, which are categorized into three subsets:

4 PAPERS • 4 BENCHMARKS

CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in \{1, 0\}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in \{0, 1\}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.

4 PAPERS • NO BENCHMARKS YET

ChangeSim

ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor data and precise ground truth labels are obtainable such as the RGB image, depth image, semantic segmentation, change segmentation, camera poses, and 3D reconstructions. While the previous online SCD datasets evaluate models given well-aligned image pairs, ChangeSim also provides raw unpaired sequences that present an opportunity to develop an online SCD model in an end-to-end manner, considering both pairing and detection. Experiments show that even the latest pair-based SCD models suffer from the bottleneck of the pairing process, and it gets worse when the environment contains the non-targeted variations.

4 PAPERS • 2 BENCHMARKS

ChineseLP

The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.

4 PAPERS • 1 BENCHMARK

CropAndWeed Dataset

The CropAndWeed dataset is focused on the fine-grained identification of 74 relevant crop and weed species with a strong emphasis on data variability. Annotations of labeled bounding boxes, semantic masks and stem positions are provided for about 112k instances in more than 8k high-resolution images of both real-world agricultural sites and specifically cultivated outdoor plots of rare weed types. Additionally, each sample is enriched with meta-annotations regarding environmental conditions.

4 PAPERS • NO BENCHMARKS YET

DiaMOS Plant (A Dataset for Diagnosis and Monitoring Plant Disease)

Abstract The classification and recognition of foliar diseases is an increasingly developing field of research, where the concepts of machine and deep learning are used to support agricultural stakeholders. Datasets are the fuel for the development of these technologies. In this paper, we release and make publicly available the field dataset collected to diagnose and monitor plant symptoms, called DiaMOS Plant, consisting of 3505 images of pear fruit and leaves affected by four diseases. In addition, we perform a comparative analysis of existing literature datasets designed for the classification and recognition of leaf diseases, highlighting the main features that maximize the value and information content of the collected data. This study provides guidelines that will be useful to the research community in the context of the selection and construction of datasets.

4 PAPERS • NO BENCHMARKS YET

DocCVQA (Document Collection Visual Question Answering)

DocCVQA is a Document Visual Question Answering dataset, where the questions are posed over a whole collection of 14,362 scanned documents. Therefore, the task can be seen as a retrieval-style evidence seeking task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as provide the answer.

4 PAPERS • NO BENCHMARKS YET

ETHEC (ETH Entomological Collection (ETHEC) Dataset)

It includes 47,978 butterfly images with a 4-level label-hierarchy. Hierarchy of labels from the ETHEC dataset across 4 levels: family, sub-family, genus and species. 6 family -> 21 sub-family -> 135 genus -> 561 species

4 PAPERS • NO BENCHMARKS YET

FewSOL (A Dataset for Few-Shot Object Learning in Robotic Environments)

The Few-Shot Object Learning (FewSOL) dataset can be used for object recognition with a few images per object. It contains 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition.

4 PAPERS • NO BENCHMARKS YET

InsPLAD (Inspection Power Line Asset Dataset)

InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 unique power line assets captured from real-world operating power lines. Some of those assets (five, to be precise) are also annotated regarding their conditions. They present the following defects: corrosion (4 of them), broken/missing cap (1 of them), and bird's nest presence (1 of them).

4 PAPERS • 1 BENCHMARK

Kvasir-Sessile dataset (Sessile polyps from Kvasir-SEG)

The Kvasir-SEG dataset includes 196 polyps smaller than 10 mm classified as Paris class 1 sessile or Paris class IIa. We have selected it with the help of expert gastroenterologists. We have released this dataset separately as a subset of Kvasir-SEG. We call this subset Kvasir-Sessile.

4 PAPERS • 1 BENCHMARK

LIMUC (Labeled Images for Ulcerative Colitis)

The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).

4 PAPERS • 1 BENCHMARK

MNIST Large Scale dataset

The MNIST Large Scale dataset is based on the classic MNIST dataset, but contains large scale variations up to a factor of 16. The motivation behind creating this dataset was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over wide scale ranges.

4 PAPERS • 1 BENCHMARK

MP-DocVQA (Multipage Document Visual Question Answering)

The dataset is aimed to perform Visual Question Answering on multipage industry scanned documents. The questions and answers are reused from Single Page DocVQA (SP-DocVQA) dataset. The images also corresponds to the same in original dataset with previous and posterior pages with a limit of up to 20 pages per document.

4 PAPERS • NO BENCHMARKS YET

Mocheg

A large-scale dataset that consists of 21,184 claims, where each claim is assigned a truthfulness label and ruling statement, with 58,523 pieces of evidence in the form of text and images. It supports the end-to-end multimodal fact-checking and explanation generation, where the input is a claim and a large collection of web sources, including articles, images, videos, and tweets, and the goal is to assess the truthfulness of the claim by retrieving relevant evidence and predicting a truthfulness label (i.e., support, refute and not enough information), and generate a rationalization statement to explain the reasoning and ruling process.

4 PAPERS • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

Multi-Label Classification Dataset Repository

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in

4 PAPERS • NO BENCHMARKS YET

MultiSubs

MultiSubs (MultiSubs: A Large-scale Multimodal and Multilingual Dataset)

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.

4 PAPERS • 5 BENCHMARKS

PGDP5K (Plane Geometry Diagram Parsing Dataset)

PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books.

4 PAPERS • 1 BENCHMARK

PIQ23

Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit

4 PAPERS • NO BENCHMARKS YET

PSI

PSI (IUPUI-CSRC Pedestrian Situated Intent)

The IUPUI-CSRC Pedestrian Situated Intent (PSI) benchmark dataset has two innovative labels besides comprehensive computer vision annotations. The first novel label is the dynamic intent changes for the pedestrians to cross in front of the ego-vehicle, achieved from 24 drivers with diverse backgrounds. The second one is the text-based explanations of the driver reasoning process when estimating pedestrian intents and predicting their behaviors during the interaction period.

4 PAPERS • NO BENCHMARKS YET

QUILT-1M

Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dat

4 PAPERS • NO BENCHMARKS YET

RF100 (Roboflow 100)

The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.

4 PAPERS • 1 BENCHMARK

RITE (Retinal Images vessel Tree Extraction)

The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries and veins on retinal fundus images, which is established based on the public available DRIVE database (Digital Retinal Images for Vessel Extraction).

4 PAPERS • 2 BENCHMARKS

RealCQA

RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic

4 PAPERS • 1 BENCHMARK

Sewer-ML

Sewer-ML is a sewer defect dataset. It contains 1.3 million images, from 75,618 videos collected from three Danish water utility companies over nine years. All videos have been annotated by licensed sewer inspectors following the Danish sewer inspection standard, Fotomanualen. This leads to consistent and reliable annotations, and a total of 17 annotated defect classes.

4 PAPERS • NO BENCHMARKS YET

SpaceNet 1 (SpaceNet 1: Building Detection v1)

SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, covering an area of 2,544 sq. km of 3/8 band WorldView-2 imagery (0.5 m pixel res.) across the city of Rio de Janeiro, Brazil. The images are processed as 200m×200m tiles with associated building footprint vectors for training.

4 PAPERS • 2 BENCHMARKS

Sports10

Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE RACING, CAR RACING, FIGHTING, HOCKEY, SOCCER, TABLE TENNIS, TENNIS.

4 PAPERS • 2 BENCHMARKS

Twitter100k

Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.

4 PAPERS • NO BENCHMARKS YET

VEDAI

VEDAI (Vehicle Detection in Aerial Imagery)

VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms in unconstrained environments. The vehicles contained in the database, in addition of being small, exhibit different variabilities such as multiple orientations, lighting/shadowing changes, specularities or occlusions. Furthermore, each image is available in several spectral bands and resolutions. A precise experimental protocol is also given, ensuring that the experimental results obtained by different people can be properly reproduced and compared. We also give the performance of some baseline algorithms on this dataset, for different settings of these algorithms, to illustrate the difficulties of the task and provide baseline comparisons.

4 PAPERS • 1 BENCHMARK

VizWiz-VQA-Grounding

The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.

4 PAPERS • NO BENCHMARKS YET

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.

4 PAPERS • 2 BENCHMARKS

YCBInEOAT Dataset

A new dataset with significant occlusions related to object manipulation.

4 PAPERS • NO BENCHMARKS YET

Zenseact Open Dataset

The Zenseact Open Dataset (ZOD) is a large-scale and diverse multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.

4 PAPERS • NO BENCHMARKS YET

AdobeVFR real

AdobeVFR real (Adobe Visual Font Recognition real-world images dataset)

Subset of AdobeVFR. The dataset contains "real-world text images".

3 PAPERS • 1 BENCHMARK

AdobeVFR syn

AdobeVFR syn (Adobe Visual Font Recognition synthetic dataset)

Subset of AdobeVFR. The dataset contains images depicting English text and consists of 1000 synthetic images for training and 100 for testing, for each of 2383 font classes. The training and test sets are called VFR_syn_train and VFR_syn_val, respectively.

3 PAPERS • 1 BENCHMARK

Aircraft Context Dataset

The Aircraft Context Dataset, a composition of two inter-compatible large-scale and versatile image datasets focusing on manned aircraft and UAVs, is intended for training and evaluating classification, detection and segmentation models in aerial domains. Additionally, a set of relevant meta-parameters can be used to quantify dataset variability as well as the impact of environmental conditions on model performance.

3 PAPERS • NO BENCHMARKS YET

AnthroProtect

For a detailed description, we refer to Section 3 in our research article.

3 PAPERS • NO BENCHMARKS YET

BnB

BnB is a large-scale and diverse in-domain VLN (Vision and Language Navigation) dataset.

3 PAPERS • NO BENCHMARKS YET