WMT 2016 is a collection of datasets used in shared tasks of the First Conference on Machine Translation. The conference builds on ten previous Workshops on statistical Machine Translation.
168 PAPERS • 18 BENCHMARKS
A corpus of parallel text in 21 European languages from the proceedings of the European Parliament.
125 PAPERS • NO BENCHMARKS YET
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
55 PAPERS • NO BENCHMARKS YET
The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.
51 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.
33 PAPERS • 1 BENCHMARK
Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.
6 PAPERS • NO BENCHMARKS YET
A new English-French test set for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori.
4 PAPERS • NO BENCHMARKS YET
SRL is the task of extracting semantic predicate-argument structures from sentences. X-SRL is a multilingual parallel Semantic Role Labelling (SRL) corpus for English (EN), German (DE), French (FR) and Spanish (ES) that is based on English gold annotations and shares the same labelling scheme across languages.
Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.
2 PAPERS • NO BENCHMARKS YET
WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.
French sentences are sourced from Tatoeba repository and then translated into Congolese Swahili.
1 PAPER • NO BENCHMARKS YET
The GATITOS (Google's Additional Translations Into Tail-languages: Often Short) dataset is a high-quality, multi-way parallel dataset of tokens and short phrases, intended for training and improving machine translation models. This dataset consists in 4,000 English segments (4,500 tokens) that have been translated into each of 26 low-resource languages, as well as three higher-resource pivot languages (es, fr, hi). All translations were made directly from English, with the exception of Aymara, which was translated from the Spanish.
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, French, Russian) and additional 1500 sentences from each of the 5 languages translated to English. The sentences are taken from newspaper articles for each language pair, except for French, where the test set was drawn from user-generated comments on the news articles (from Guardian and Le Monde). The translation was done by professional translators.
The Biomedical Translation Shared Task was first introduced at the First Conference of Machine Translation. The task aims to evaluate systems for the translation of biomedical titles and abstracts from scientific publications. The data includes three language pairs (English ↔ Portuguese, English ↔ Spanish, English ↔ French) and two sub-domains of biological sciences and health sciences.