Construction and Analysis of a Large Vietnamese Text Corpus

LREC 2016  ·  Dieu-Thu Le, Uwe Quasthoff ·

This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample sentences where this word occurs, its left and right neighbors.

PDF Abstract LREC 2016 PDF LREC 2016 Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here