Multilingual Document Classification Corpus (MLDoc) is a cross-lingual document classification dataset covering English, German, French, Spanish, Italian, Russian, Japanese and Chinese. It is a subset of the Reuters Corpus Volume 2 selected according to the following design choices:
51 PAPERS • 11 BENCHMARKS
MultiBooked is a dataset for supervised aspect-level sentiment analysis in Basque and Catalan, both of which are under-resourced languages.
8 PAPERS • NO BENCHMARKS YET