WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:
85 PAPERS • NO BENCHMARKS YET
CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.
57 PAPERS • NO BENCHMARKS YET
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
5 PAPERS • NO BENCHMARKS YET
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
4 PAPERS • NO BENCHMARKS YET