The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
1 PAPER • NO BENCHMARKS YET
TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.
UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.
1 PAPER • 31 BENCHMARKS
Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources The STEM ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises annotations for scientific entities in scientific Abstracts drawn from 10 disciplines in Science, Technology, Engineering, and Medicine. The annotated entities are further grounded to Wikipedia and Wiktionary, respectively.
0 PAPER • NO BENCHMARKS YET