WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
54 PAPERS • 7 BENCHMARKS
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
8 PAPERS • NO BENCHMARKS YET
HengamCopus is a Persian corpus with temporal tags (BIO standard tagging scheme). This dataset was generated by applying HengamTagger (https://github.com/kargaranamir/parstdex) to a large number of sentences. There are two types of Persian text datasets included in these collections: formal ones (Persian Wikipedia and Hamshahri Corpus), and informal ones (Twitter and HelloKish). In the creation of HengamCorpus, to maximize the diversity of patterns for training and evaluation, they uniformly draw samples from sets of sentences of unique “temporal pattern profile”, presence/absence vector of different temporal patterns within the sentence.
1 PAPER • 1 BENCHMARK