Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi'kmaq Language Modelling

LREC 2020 · Jeremie Boudreau, Akankshya Patra, Ashima Suvarna, Paul Cook ·

Mi{'}kmaq is an Indigenous language spoken primarily in Eastern Canada. It is polysynthetic and low-resource. In this paper we consider a range of n-gram and RNN language models for Mi{'}kmaq. We find that an RNN language model, initialized with pre-trained fastText embeddings, performs best, highlighting the importance of sub-word information for Mi{'}kmaq language modelling. We further consider approaches to language modelling that incorporate cross-lingual word embeddings, but do not see improvements with these models. Finally we consider language models that operate over segmentations produced by SentencePiece {---} which include sub-word units as tokens {---} as opposed to word-level models. We see improvements for this approach over word-level language models, again indicating that sub-word modelling is important for Mi{'}kmaq language modelling.

PDF Abstract