Impact of Corpora Quality on Neural Machine Translation

19 Oct 2018 β€’ MatΔ«ss Rikters

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Machine Translation WMT 2017 English-Latvian Transformer trained on highly filtered data BLEU 22.89 # 1
Machine Translation WMT 2017 Latvian-English Transformer trained on highly filtered data BLEU 24.37 # 1
Machine Translation WMT 2018 English-Finnish Transformer trained on highly filtered data BLEU 17.40 # 1
Machine Translation WMT 2018 Finnish-English Transformer trained on highly filtered data BLEU 24.00 # 2

Methods used in the Paper


METHOD TYPE
πŸ€– No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet