11 dataset results for Entity Disambiguation

The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.

176 PAPERS • 52 BENCHMARKS

AIDA CoNLL-YAGO

AIDA CoNLL-YAGO contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003 entity recognition task. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid.

63 PAPERS • 3 BENCHMARKS

ACE 2004

ACE 2004 (ACE 2004 Multilingual Training Corpus)

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

46 PAPERS • 5 BENCHMARKS

Mewsli-9

A large new multilingual dataset for multilingual entity linking.

8 PAPERS • 1 BENCHMARK

AQUAINT

The AQUAINT Corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).

6 PAPERS • 1 BENCHMARK

SoMeSci

SoMeSci (Software Mentions in Scientific Articles)

Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci - Software Mentions in Science - a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: κ = .82) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usag

4 PAPERS • NO BENCHMARKS YET

TAC 2010

TAC 2010 is a dataset for summarization that consists of 44 topics, each of which is associated with a set of 10 documents. The test dataset is composed of approximately 44 topics, divided into five categories: Accidents and Natural Disasters, Attacks, Health and Safety, Endangered Resources, Investigations and Trials.

4 PAPERS • 1 BENCHMARK

Hansel

Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:

2 PAPERS • NO BENCHMARKS YET

WikiSRS

WikiSRS is a novel dataset of similarity and relatedness judgments of paired Wikipedia entities (people, places, and organizations), as assigned by Amazon Mechanical Turk workers.

2 PAPERS • NO BENCHMARKS YET

Wikidata-Disamb

The Wikidata-Disamb dataset is intended to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research.

2 PAPERS • NO BENCHMARKS YET

ShadowLink

ShadowLink dataset is designed to evaluate the impact of entity overshadowing on the task of entity disambiguation. Paper: "Robustness Evaluation of Entity Disambiguation Using Prior Probes: the Case of Entity Overshadowing" by Vera Provatorova, Svitlana Vakulenko, Samarth Bhargav, Evangelos Kanoulas. EMNLP 2021.

1 PAPER • NO BENCHMARKS YET

Datasets

11 dataset results for Entity Disambiguation