Diacritics Restoration using BERT with Analysis on Czech language

24 May 2021  ·  Jakub Náplava, Milan Straka, Jana Straková ·

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Czech Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.22 # 1
Romanian Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 98.64 # 1
Turkish Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 98.95 # 1
Hungarian Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.41 # 1
Croatian Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.73 # 1
Spanish Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.62 # 1
Irish Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 98.88 # 1
French Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.71 # 1
Slovak Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.32 # 1
Polish Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 99.66 # 1
Latvian Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 98.63 # 1
Vietnamese Text Diacritization Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems BERT Alpha-Word accuracy 98.53 # 1

Methods