Wiki-40B

Introduced by Guo et al. in Wiki-40B: Multilingual Language Model Dataset

A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families containing round 40 billion characters and aimed to accelerate the research of multilingual modeling.

Source: Wiki-40B: Multilingual Language Model Dataset

Papers


Paper Code Results Date Stars

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages