VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.
481 PAPERS • 5 BENCHMARKS
The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).
1 PAPER • 3 BENCHMARKS