Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
13 PAPERS • 1 BENCHMARK
DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:
6 PAPERS • NO BENCHMARKS YET
Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.
1 PAPER • 2 BENCHMARKS
The dataset contains the training and test data for the SOftware Mention Detection challenge. The data is derived from the SoMeSci Knowledge Graph of software mentions.
1 PAPER • NO BENCHMARKS YET