2 dataset results for Data Augmentation AND Texts AND French

MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split.

194 PAPERS • 2 BENCHMARKS

CONAN (COunter NArratives through Nichesourcing)

COunter NArratives through Nichesourcing (CONAN) is a dataset that consists of 4,078 pairs over the 3 languages. Additionally, 3 types of metadata are provided: expert demographics, hate speech sub-topic and counter-narrative type. The dataset is augmented through translation (from Italian/French to English) and paraphrasing, which brought the total number of pairs to 14.988.

21 PAPERS • NO BENCHMARKS YET

Datasets

2 dataset results for Data Augmentation AND Texts AND French