The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.
978 PAPERS • 10 BENCHMARKS
The Numeric Fused-Head dataset consists of ~10K examples of crowd-sourced classified examples, labeled into 7 different categories, from two types. In the first type, Reference, the missing head is referenced explicitly somewhere else in the discourse, either in the same sentence or in surrounding sentences. In the second type, Implicit, the missing head does not appear in the text and needs to be inferred by the reader or hearer based on the context or world knowledge. This category was labeled into the 6 most common categories of the dataset. Models are evaluated based on accuracy.
1 PAPER • 2 BENCHMARKS