5 dataset results for Language Modelling AND Multilingual

XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. The dataset consists of a subset of 240 paragraphs and 1190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Consequently, the dataset is entirely parallel across 11 languages.

169 PAPERS • 2 BENCHMARKS

OSCAR

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.

55 PAPERS • NO BENCHMARKS YET

Glot500-c

Glot500-c (Glot500 Corpus)

A dataset of natural language data collected by putting together more than 150 existing mono-lingual and multilingual datasets together and crawling known multilingual websites. The focus of this dataset is on 500 extremely low-resource languages.

1 PAPER • NO BENCHMARKS YET

GlotSparse

Collection of news websites in low-resource languages.

1 PAPER • NO BENCHMARKS YET

GlotStoryBook

StoryBooks for 174 unique languages.

1 PAPER • NO BENCHMARKS YET

Datasets

5 dataset results for Language Modelling AND Multilingual