The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
265 PAPERS • 8 BENCHMARKS
TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.
186 PAPERS • 2 BENCHMARKS
The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three subsets: training set (64 relations), validation set (16 relations) and test set (20 relations).
171 PAPERS • 4 BENCHMARKS
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
145 PAPERS • 3 BENCHMARKS
DocRED (Document-Level Relation Extraction Dataset) is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset is human-annotated with named entity mentions, coreference information, intra- and inter-sentence relations, and supporting evidence. DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. Along with the human-annotated data, the dataset provides large-scale distantly supervised data.
144 PAPERS • 4 BENCHMARKS
The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories.
143 PAPERS • 17 BENCHMARKS
The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.
123 PAPERS • NO BENCHMARKS YET
SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.
121 PAPERS • 7 BENCHMARKS
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
62 PAPERS • 9 BENCHMARKS
ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
47 PAPERS • 5 BENCHMARKS
The DDIExtraction 2013 task relies on the DDI corpus which contains MedLine abstracts on drug-drug interactions as well as documents describing drug-drug interactions from the DrugBank database.
47 PAPERS • 3 BENCHMARKS
The Re-TACRED dataset is a significantly improved version of the TACRED dataset for relation extraction. Using new crowd-sourced labels, Re-TACRED prunes poorly annotated sentences and addresses TACRED relation definition ambiguity, ultimately correcting 23.9% of TACRED labels. This dataset contains over 91 thousand sentences spread across 40 relations. Dataset presented at AAAI 2021.
47 PAPERS • 1 BENCHMARK
QA-SRL was proposed as an open schema for semantic roles, in which the relation between an argument and a predicate is expressed as a natural-language question containing the predicate (“Where was someone educated?”) whose answer is the argument (“Princeton”). The authors collected about 19,000 question-answer pairs from 3,200 sentences.
41 PAPERS • NO BENCHMARKS YET
RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.
Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.
19 PAPERS • 1 BENCHMARK
2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.
18 PAPERS • 4 BENCHMARKS
JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.
18 PAPERS • 2 BENCHMARKS
The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.
17 PAPERS • 4 BENCHMARKS
ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task.
16 PAPERS • 1 BENCHMARK
BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
14 PAPERS • 3 BENCHMARKS
A SemEval shared task in which participants must extract definitions from free text using a term-definition pair corpus that reflects the complex reality of definitions in natural language.
14 PAPERS • NO BENCHMARKS YET
The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the titles and abstracts of 1500 PubMed articles and is split into equally sized train, validation and test sets. It is common to first tune a model on the validation set and then train on the combination of the train and validation sets before evaluating on the test set. It is also common to filter negative relations with disease entities that are hypernyms of a corresponding true relations disease entity within the same abstract (see Appendix C of this paper for details).
11 PAPERS • 2 BENCHMARKS
The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the temporal relations in clinical narratives. The organizers provided the research community with a corpus of discharge summaries annotated with temporal information, to be used for the development and evaluation of temporal reasoning systems. 18 teams from around the world participated in the challenge. During the workshop, participating teams presented comprehensive reviews and analysis of their systems, and outlined future research directions suggested by the challenge contributions.
9 PAPERS • 2 BENCHMARKS
MAVEN-ERE is a dataset designed for event relation extraction tasks containing 103,193 event coreference chains, 1,216,217 temporal relations, 57,992 causal relations, and 15,841 subevent relations.
7 PAPERS • NO BENCHMARKS YET
KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web. KnowledgeNet provides text exhaustively annotated with facts, thus enabling the holistic end-to-end evaluation of knowledge base population systems as a whole, unlike previous benchmarks that are more suitable for the evaluation of individual subcomponents (e.g., entity linking, relation extraction).
6 PAPERS • NO BENCHMARKS YET
FinRED is a relation extraction dataset curated from financial news and earning call transcripts containing relations from the finance domain. FinRED has been created by mapping Wikidata triplets using distance supervision method.
5 PAPERS • NO BENCHMARKS YET
GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.
5 PAPERS • 1 BENCHMARK
HyperRED is a dataset for the new task of hyper-relational extraction, which extracts relation triplets together with qualifier information such as time, quantity or location. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). HyperRED contains 44k sentences with 62 relation types and 44 qualifier types.
5 PAPERS • 4 BENCHMARKS
The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert linguistic annotators. Relabeling focuses on the 5K most challenging instances in dev and test, in total, 51.2% of these are corrected. Published at ACL 2020.
CrossRE is a cross-domain benchmark for Relation Extraction (RE), which comprises six distinct text domains and includes multi-label annotations. The dataset includes meta-data collected during annotation, to include explanations and flags of difficult instances.
4 PAPERS • NO BENCHMARKS YET
DiS-ReX is a multilingual dataset for distantly supervised (DS) relation extraction (RE). The dataset has over 1.5 million instances, spanning 4 languages (English, Spanish, German and French). The dataset has 36 positive relation types + 1 no relation (NA) class.
TimeBankPT is a corpus of Portuguese text with annotations about time. The annotation scheme used is similar to TimeML. TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.
4 PAPERS • 1 BENCHMARK
Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.
2 PAPERS • NO BENCHMARKS YET
The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.
MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.
NYT-H is a dataset for distantly-supervised relation extraction, in which DS-labelled training data is used and several annotators to label test data are hired. NYT-H can serve as a benchmark of distantly-supervised relation extraction.
X-WikiRE is a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
1 PAPER • NO BENCHMARKS YET
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files:
1 PAPER • 1 BENCHMARK
DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.
This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .
1 PAPER • 2 BENCHMARKS
Medical Case Report Corpus is a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.
Multi-CrossRE is a broadest multi-lingual dataset for Relation Extraction (RE) including 26 languages in addition to English, and covering six text domains. It is a machine translated version of CrossRE crossre, with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers.
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).
The Part-Whole Relations dataset is a dataset of semantic relations between entities. It contains the following subtypes: - Component-Of - Member-Of - Portion-Of - Stuff-Of - Located-In - Contained-In - Phase-Of - Participates-In
The corpus contains review sentences mostly of products in electronics domain, annotated and segregated into 4 comparison categories. Each comparison sentence is annotated with names of the products (PROD1 and PROD2), the aspect (ASP) and the predicate (PRED). Dataset contains sentences after auto-labeling on SNAP dataset and manually labeled sentences from the following corpora: