The Sunnybrook Cardiac Data (SCD), also known as the 2009 Cardiac MR Left Ventricle Segmentation Challenge data, consist of 45 cine-MRI images from a mixed of patients and pathologies: healthy, hypertrophy, heart failure with infarction and heart failure without infarction. Subset of this data set was first used in the automated myocardium segmentation challenge from short-axis MRI, held by a MICCAI workshop in 2009. The whole complete data set is now available in the CAP database with public domain license.
5 PAPERS • NO BENCHMARKS YET
TRANCE extends CLEVR by asking a uniform question, i.e. what is the transformation between two given images, to test the ability of transformation reasoning. TRANCE includes three levels of settings, i.e. Basic (single-step transformation), Event (multi-step transformation), and View (multi-step transformation with variant views). Detailed information can be found in https://hongxin2019.github.io/TVR.
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
5 PAPERS • 7 BENCHMARKS
VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT is a high-quality and multi-modal benchmark based on VideoCube-Tiny to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content.
Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment!
5 PAPERS • 1 BENCHMARK
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.
4 PAPERS • NO BENCHMARKS YET
AutoChart is a dataset for chart-to-text generation, a task that consists on generating analytical descriptions of visual plots.
Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.
4 PAPERS • 1 BENCHMARK
CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by six advanced models, which are categorized into three subsets:
4 PAPERS • 4 BENCHMARKS
CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in \{1, 0\}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in \{0, 1\}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.
ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor data and precise ground truth labels are obtainable such as the RGB image, depth image, semantic segmentation, change segmentation, camera poses, and 3D reconstructions. While the previous online SCD datasets evaluate models given well-aligned image pairs, ChangeSim also provides raw unpaired sequences that present an opportunity to develop an online SCD model in an end-to-end manner, considering both pairing and detection. Experiments show that even the latest pair-based SCD models suffer from the bottleneck of the pairing process, and it gets worse when the environment contains the non-targeted variations.
4 PAPERS • 2 BENCHMARKS
The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.
The CropAndWeed dataset is focused on the fine-grained identification of 74 relevant crop and weed species with a strong emphasis on data variability. Annotations of labeled bounding boxes, semantic masks and stem positions are provided for about 112k instances in more than 8k high-resolution images of both real-world agricultural sites and specifically cultivated outdoor plots of rare weed types. Additionally, each sample is enriched with meta-annotations regarding environmental conditions.
Abstract The classification and recognition of foliar diseases is an increasingly developing field of research, where the concepts of machine and deep learning are used to support agricultural stakeholders. Datasets are the fuel for the development of these technologies. In this paper, we release and make publicly available the field dataset collected to diagnose and monitor plant symptoms, called DiaMOS Plant, consisting of 3505 images of pear fruit and leaves affected by four diseases. In addition, we perform a comparative analysis of existing literature datasets designed for the classification and recognition of leaf diseases, highlighting the main features that maximize the value and information content of the collected data. This study provides guidelines that will be useful to the research community in the context of the selection and construction of datasets.
DocCVQA is a Document Visual Question Answering dataset, where the questions are posed over a whole collection of 14,362 scanned documents. Therefore, the task can be seen as a retrieval-style evidence seeking task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as provide the answer.
It includes 47,978 butterfly images with a 4-level label-hierarchy. Hierarchy of labels from the ETHEC dataset across 4 levels: family, sub-family, genus and species. 6 family -> 21 sub-family -> 135 genus -> 561 species
The Few-Shot Object Learning (FewSOL) dataset can be used for object recognition with a few images per object. It contains 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition.
InsPLAD is a Dataset for Power Line Asset Inspection containing 10,607 high-resolution Unmanned Aerial Vehicles colour images. It contains 17 unique power line assets captured from real-world operating power lines. Some of those assets (five, to be precise) are also annotated regarding their conditions. They present the following defects: corrosion (4 of them), broken/missing cap (1 of them), and bird's nest presence (1 of them).
The Kvasir-SEG dataset includes 196 polyps smaller than 10 mm classified as Paris class 1 sessile or Paris class IIa. We have selected it with the help of expert gastroenterologists. We have released this dataset separately as a subset of Kvasir-SEG. We call this subset Kvasir-Sessile.
The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).
The MNIST Large Scale dataset is based on the classic MNIST dataset, but contains large scale variations up to a factor of 16. The motivation behind creating this dataset was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over wide scale ranges.
The dataset is aimed to perform Visual Question Answering on multipage industry scanned documents. The questions and answers are reused from Single Page DocVQA (SP-DocVQA) dataset. The images also corresponds to the same in original dataset with previous and posterior pages with a limit of up to 20 pages per document.
A large-scale dataset that consists of 21,184 claims, where each claim is assigned a truthfulness label and ruling statement, with 58,523 pieces of evidence in the form of text and images. It supports the end-to-end multimodal fact-checking and explanation generation, where the input is a claim and a large collection of web sources, including articles, images, videos, and tweets, and the goal is to assess the truthfulness of the claim by retrieving relevant evidence and predicting a truthfulness label (i.e., support, refute and not enough information), and generate a rationalization statement to explain the reasoning and ruling process.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.
4 PAPERS • 5 BENCHMARKS
PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books.
Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit
The IUPUI-CSRC Pedestrian Situated Intent (PSI) benchmark dataset has two innovative labels besides comprehensive computer vision annotations. The first novel label is the dynamic intent changes for the pedestrians to cross in front of the ego-vehicle, achieved from 24 drivers with diverse backgrounds. The second one is the text-based explanations of the driver reasoning process when estimating pedestrian intents and predicting their behaviors during the interaction period.
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dat
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
The RITE (Retinal Images vessel Tree Extraction) is a database that enables comparative studies on segmentation or classification of arteries and veins on retinal fundus images, which is established based on the public available DRIVE database (Digital Retinal Images for Vessel Extraction).
RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic
Sewer-ML is a sewer defect dataset. It contains 1.3 million images, from 75,618 videos collected from three Danish water utility companies over nine years. All videos have been annotated by licensed sewer inspectors following the Danish sewer inspection standard, Fotomanualen. This leads to consistent and reliable annotations, and a total of 17 annotated defect classes.
SpaceNet 1: Building Detection v1 is a dataset for building footprint detection. The data is comprised of 382,534 building footprints, covering an area of 2,544 sq. km of 3/8 band WorldView-2 imagery (0.5 m pixel res.) across the city of Rio de Janeiro, Brazil. The images are processed as 200m×200m tiles with associated building footprint vectors for training.
Games dataset containing 100,000 Gameplay Images of 175 Video Games across 10 Sports Genres - AMERICAN FOOTBALL, BASKETBALL, BIKE RACING, CAR RACING, FIGHTING, HOCKEY, SOCCER, TABLE TENNIS, TENNIS.
Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.
VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms in unconstrained environments. The vehicles contained in the database, in addition of being small, exhibit different variabilities such as multiple orientations, lighting/shadowing changes, specularities or occlusions. Furthermore, each image is available in several spectral bands and resolutions. A precise experimental protocol is also given, ensuring that the experimental results obtained by different people can be properly reproduced and compared. We also give the performance of some baseline algorithms on this dataset, for different settings of these algorithms, to illustrate the difficulties of the task and provide baseline comparisons.
The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
A new dataset with significant occlusions related to object manipulation.
The Zenseact Open Dataset (ZOD) is a large-scale and diverse multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.
Subset of AdobeVFR. The dataset contains "real-world text images".
3 PAPERS • 1 BENCHMARK
Subset of AdobeVFR. The dataset contains images depicting English text and consists of 1000 synthetic images for training and 100 for testing, for each of 2383 font classes. The training and test sets are called VFR_syn_train and VFR_syn_val, respectively.
The Aircraft Context Dataset, a composition of two inter-compatible large-scale and versatile image datasets focusing on manned aircraft and UAVs, is intended for training and evaluating classification, detection and segmentation models in aerial domains. Additionally, a set of relevant meta-parameters can be used to quantify dataset variability as well as the impact of environmental conditions on model performance.
3 PAPERS • NO BENCHMARKS YET
For a detailed description, we refer to Section 3 in our research article.
BnB is a large-scale and diverse in-domain VLN (Vision and Language Navigation) dataset.