CVSS is a massively multilingual-to-English speech to speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems
18 PAPERS • 1 BENCHMARK
The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition to the full data set, where the quality is even higher. Furthermore, there are various statistics. The dataset can also be used for automatic speech recognition (ASR) if audio files are converted to 16 kHz.
3 PAPERS • 2 BENCHMARKS
Thorsten-Voice (Thorsten-21.02-neutral) is a neutrally spoken voice dataset recorded by Thorsten Müller, audio optimized by Dominik Kreutz and licenced under CC0 to provide it for anybody without any financial or licence struggle. It is intended to be used for speech synthesis in German as a single speaker dataset. It contains about 23 hours of high quality audio
1 PAPER • 1 BENCHMARK