News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,
23 PAPERS • 8 BENCHMARKS
The MUSE dataset contains bilingual dictionaries for 110 pairs of languages. For each language pair, the training seed dictionaries contain approximately 5000 word pairs while the evaluation sets contain 1500 word pairs.
2 PAPERS • 2 BENCHMARKS
The ConcoDisco Corpus is an English-French parallel corpus with discourse relations (DRs) and discourse connectives (DCs) annotations.
1 PAPER • NO BENCHMARKS YET