12 dataset results for NER AND English

CoNLL 2003

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

638 PAPERS • 16 BENCHMARKS

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.

13 PAPERS • 3 BENCHMARKS

BC4CHEMD

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

4 PAPERS • 1 BENCHMARK

FiNER-139

FiNER-139 is comprised of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of 139 entity types. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself.

3 PAPERS • NO BENCHMARKS YET

Biographical

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 PAPERS • NO BENCHMARKS YET

AISECKG

AISECKG (AISecKG: Knowledge Graph Dataset for Cybersecurity Education)

Cybersecurity education is exceptionally challenging as it involves learning the complex attacks; tools and developing critical problem-solving skills to defend the systems. For a student or novice researcher in the cybersecurity domain, there is a need to design an adaptive learning strategy that can break complex tasks and concepts into simple representations. An AI-enabled automated cybersecurity education system can improve cognitive engagement and active learning. Knowledge graphs (KG) provide a visual representation in a graph that can reason and interpret from the underlying data, making them suitable for use in education and interactive learning. However, there are no publicly available datasets for the cybersecurity education domain to build such systems. The data is present as unstructured educational course material, Wiki pages, capture the flag (CTF) writeups, etc. Creating knowledge graphs from unstructured text is challenging without an ontology or annotated dataset. Howe

1 PAPER • NO BENCHMARKS YET

IECSIL FIRE-2018 Shared Task

The dataset is taken from the First shared task on Information Extractor for Conversational Systems in Indian Languages (IECSIL) . It consists of 15,48,570 Hindi words in Devanagari script and corresponding NER labels. Each sentence end is marked by \newline" tag. Fig. 1 shows a snapshot of one sentence in the dataset. Our Dataset has nine classes, namely, Datenum, Event, Location, Name, Number, Occupation, Organization, Other, Things.

1 PAPER • 1 BENCHMARK

NuNER

The dataset used to pre-train NuNER from the NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

1 PAPER • NO BENCHMARKS YET

TASTEset

TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.

1 PAPER • NO BENCHMARKS YET

The EMBO SourceData-NLP dataset

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

1 PAPER • 1 BENCHMARK

Noise-SF

Based on RADDLE and SNIPS , we construct Noise-SF, which includes two different perturbation settings. For single perturbations setting, we include five types of noisy utterances (character-level: \textbf{Typos}, word-level: \textbf{Speech}, and sentence-level: \textbf{Simplification}, \textbf{Verbose}, and \textbf{Paraphrase}) from RADDLE. For mixed perturbations setting, we utilize TextFlint to introduce character-level perturbation (\textbf{EntTypos}), word-level perturbation (\textbf{Subword}), and sentence-level perturbation (\textbf{AppendIrr}) and combine them to get a mixed perturbations dataset.

0 PAPER • NO BENCHMARKS YET

SourceData-NLP

SourceData-NLP (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental de

0 PAPER • NO BENCHMARKS YET

Datasets

12 dataset results for NER AND English