The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.
94 PAPERS • 4 BENCHMARKS
The Tobacco-3482 dataset consists of document images belonging to 10 classes such as letter, form, email, resume, memo, etc. The dataset has 3482 images.
5 PAPERS • 3 BENCHMARKS
The Sacrobosco Visual Elements Dataset (S-VED) is derived from 359 Sphaera editions, centered on the Tractatus de sphaera by Johannes de Sacrobosco (—1256) and printed between 1472 and 1650. The Sphaera editions were primarily used to teach geocentric astronomy to university students across Europe. Their visual elements, therefore, played an essential role in visualizing the ideas, messages, and concepts that the texts transmitted. As a precondition for studying the relation between text and visual elements, a time-consuming image labelling process was conducted as part of “The Sphere” project (https://sphaera.mpiwg-berlin.mpg.de) in order to extract and label the visual elements from the 76,000 pages of the corpus. This process resulted in the creation of the Extended Sacrobosco Visual Elements Dataset (S-VED𝑋) of which S-VED is a subset of. Due to copyright reasons only S-VED is made publicly available. S-VED consists of 4000 pages of which 2040 contain a total of 2927 visual element
2 PAPERS • NO BENCHMARKS YET
This paper introduces a new large-scale dataset for Farsi document images, named SUT, which aims to tackle the challenges associated with obtaining diverse and substantial ground-truth data for supervised models in document image analysis (DIA) tasks, like document image classification, text detection and recognition, and information retrieval. The dataset comprises 62,453 images that have been categorized into 21 distinct classes, including identity documents featuring synthetically generated personal information superimposed on various backgrounds. The dataset also includes corresponding files with labeling information for the images. The ground-truth data is organized in CSV files containing image file paths and associated information about the embedded data.
1 PAPER • 2 BENCHMARKS