9 dataset results for Machine Translation AND Czech

WMT 2016

WMT 2016 is a collection of datasets used in shared tasks of the First Conference on Machine Translation. The conference builds on ten previous Workshops on statistical Machine Translation.

170 PAPERS • 18 BENCHMARKS

Europarl

Europarl (European Parliament Proceedings Parallel Corpus)

A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.

125 PAPERS • NO BENCHMARKS YET

WMT 2020

WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.

33 PAPERS • 1 BENCHMARK

WMT 2016 News (WMT 2016 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,

23 PAPERS • 8 BENCHMARKS

WMT 2018 News (WMT 2018 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.

8 PAPERS • NO BENCHMARKS YET

Demetr

Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.

6 PAPERS • NO BENCHMARKS YET

WMT 2014 Medical

WMT 2014 Medical (WMT 2014 Medical Translation Task)

The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.

1 PAPER • NO BENCHMARKS YET

WMT 2015 News (WMT 2015 News Translation Task)

1 PAPER • NO BENCHMARKS YET

WMT 2016 IT (WMT 2016 IT Translation Task)

The IT Translation Task is a shared task introduced in the First Conference on Machine Translation. Compared to WMT 2016 News, this task brought several novelties to WMT:

1 PAPER • NO BENCHMARKS YET

Datasets

9 dataset results for Machine Translation AND Czech