The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average.
357 PAPERS • 6 BENCHMARKS
The RCV1 dataset is a benchmark dataset on text categorization. It is a collection of newswire articles producd by Reuters in 1996-1997. It contains 804,414 manually labeled newswire documents, and categorized with respect to three controlled vocabularies: industries, topics and regions.
234 PAPERS • 4 BENCHMARKS
The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.
37 PAPERS • 3 BENCHMARKS
The Slashdot dataset is a relational dataset obtained from Slashdot. Slashdot is a technology-related news website know for its specific user community. The website features user-submitted and editor-evaluated current primarily technology oriented news. In 2002 Slashdot introduced the Slashdot Zoo feature which allows users to tag each other as friends or foes. The network cotains friend/foe links between the users of Slashdot. The network was obtained in February 2009.
36 PAPERS • 2 BENCHMARKS
EURLEX57K is a new publicly available legal LMTC dataset, dubbed EURLEX57K, containing 57k English EU legislative documents from the EUR-LEX portal, tagged with ∼4.3k labels (concepts) from the European Vocabulary (EUROVOC).
6 PAPERS • NO BENCHMARKS YET
This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China
1 PAPER • 1 BENCHMARK