The benchmarks section lists all benchmarks using a given dataset or any of
its variants. We use variants to distinguish between results evaluated on
slightly different versions of the same dataset. For example, ImageNet 32⨉32
and ImageNet 64⨉64 are variants of the ImageNet dataset.
Multilingual Document Classification Corpus (MLDoc) is a cross-lingual document classification dataset covering English, German, French, Spanish, Italian, Russian, Japanese and Chinese. It is a subset of the Reuters Corpus Volume 2 selected according to the following design choices:
uniform class coverage: same number of examples for each class and language,
official train / development / test split: for each language a training data of different sizes (1K, 2K, 5K and 10K stories), a development (1K) and a test corpus (4K) are provided (with exception of Spanish and Russian with 9458 and 5216 training documents respectively.