MOROCO (MOldavian and ROmanian Dialectal COrpus)

Introduced by Butnaru et al. in MOROCO: The Moldavian and Romanian Dialectal Corpus

The MOldavian and ROmanian Dialectal COrpus (MOROCO) is a corpus that contains 33,564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21,719 samples for training, 5,921 samples for validation and another 5,924 samples for testing.

Source: MOROCO: The Moldavian and Romanian Dialectal Corpus

Homepage