Abstract Objective This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it.
7 PAPERS • NO BENCHMARKS YET
GeoWebNews provides test/train examples and enable fine-grained Geotagging and Toponym Resolution (Geocoding). This dataset is also suitable for prototyping and evaluating machine learning NLP models.
Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.
The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).
6 PAPERS • NO BENCHMARKS YET
Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations.
6 PAPERS • 1 BENCHMARK
The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., 2018) that consists of from the 623 protocols. We excluded the eight duplicate protocols from this dataset and then re-annotated the 615 unique protocols in BRAT (Stenetorp et al., 2012).
6 PAPERS • 2 BENCHMARKS
legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.
5 PAPERS • NO BENCHMARKS YET
Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.
5 PAPERS • 5 BENCHMARKS
Europeana Newspapers consists of four datasets with 100 pages each for the languages Dutch, French, German (including Austrian) as part of the Europeana Newspapers project is expected to contribute to the further development and improvement of named entity recognition systems with a focus on historical content.
Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. To increase the corpus taxonomic mention diversity the 800 abstracts were collected by selecting 100 abstracts from the following 8 categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology and zoology. 800 has been annotated with a focus at the species level; however, higher taxa mentions (such as genera, families and orders) have also been considered.
5 PAPERS • 1 BENCHMARK
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles
4 PAPERS • 2 BENCHMARKS
DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.
4 PAPERS • NO BENCHMARKS YET
Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
4 PAPERS • 3 BENCHMARKS
Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.
Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci - Software Mentions in Science - a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: κ = .82) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usag
This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.
4 PAPERS • 1 BENCHMARK
We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.
3 PAPERS • 1 BENCHMARK
COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.
3 PAPERS • NO BENCHMARKS YET
DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.
E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.
The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle on the road.
LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.). It can produce multiple types of output (XML, HTML, tab-separated-value file, or save to a database). It also contains methods for acting as a server (including load balancing across several servers), allowing clients to request matching over a network. A package with files for recognizing and identifying species names is available for LINNAEUS, showing 94% recall and 97% precision compared to LINNAEUS-species-corpus.
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. We publish the NEMO corpus in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO^2)" [1], where we use it in extensive experiments and analyses, showing the importance of morphological boundaries for neural modeling of NER in morphologically rich languages. Code for these models and experiments can be found in the NEMO code repo.
3 PAPERS • 3 BENCHMARKS
Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.
An open, broad-coverage corpus for informal Persian named entity recognition was collected from Twitter.
PhoNER_COVID19 is a dataset for recognising COVID-19 related named entities in Vietnamese, consisting of 35K entities over 10K sentences. The authors defined 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics.
3 PAPERS • 2 BENCHMARKS
A vast amount of information in the biomedical domain is available as natural language free text. An increasing number of documents in the field are written in languages other than English. Therefore, it is essential to develop resources, methods and tools that address Natural Language Processing in the variety of languages used by the biomedical community. In this paper, we report on the development of an extensive corpus of biomedical documents in French annotated at the entity and concept level. Three text genres are covered, comprising a total of 103,056 words. Ten entity categories corresponding to UMLS Semantic Groups were annotated, using automatic pre-annotations validated by trained human annotators. The pre-annotation method was found helful for entities and achieved above 0.83 precision for all text genres. Overall, a total of 26,409 entity annotations were mapped to 5,797 unique UMLS concepts.
ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.
Full-text chemical identification and indexing in PubMed articles.
2 PAPERS • 3 BENCHMARKS
Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.
2 PAPERS • NO BENCHMARKS YET
CLUENER2020 is a well-defined fine-grained dataset for named entity recognition in Chinese. CLUENER2020 contains 10 categories.
Chinese Gigaword corpus consists of 2.2M of headline-document pairs of news stories covering over 284 months from two Chinese newspapers, namely the Xinhua News Agency of China (XIN) and the Central News Agency of Taiwan (CNA).
KIND is an Italian dataset for Named-Entity Recognition. It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.
MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.
PcMSP is a dataset annotated from 305 open access scientific articles for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv.
2 PAPERS • 1 BENCHMARK
Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic ev
The pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. The automatically generated corpus is generated from Wikipedia. The gold-standard set is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages.
Digital Edition: Essays from Hannah Arendt We have created a NER dataset from the digital edition "Sechs Essays" by Hannah Arendt. It consists of 23 documents from the period 1932-1976, which are available as TEI files online (see https://hannah-arendt-edition.net/3p.html?lang=de).
1 PAPER • NO BENCHMARKS YET
BUSiness Transaction Entity Recognition dataset.
The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on