🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

87 dataset results for Speech Recognition

Libri-Adapt

Libri-Adapt aims to support unsupervised domain adaptation research on speech recognition models.

5 PAPERS • 2 BENCHMARKS

word2word

word2word contains easy-to-use word translations for 3,564 language pairs.

5 PAPERS • NO BENCHMARKS YET

Artie Bias Corpus

Artie Bias Corpus is an open dataset for detecting demographic bias in speech applications.

4 PAPERS • NO BENCHMARKS YET

CSRC

CSRC (Children Speech Recognition Challenge)

CSRC is a collection of data for Children Speech Recognition. The data for this challenge is divided into 3 datasets, referred to as A (Adult speech training set), C1 (Children speech training set) and C2 (Children conversation training set). All dataset combined amount to 400 hours of Mandarin speech data.

4 PAPERS • NO BENCHMARKS YET

ClovaCall

ClovaCall is a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people. The raw dataset of ClovaCall includes approximately 112,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.

4 PAPERS • NO BENCHMARKS YET

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing. The dataset contains 10 hours of speech for each language provided. This release contains audio datasets in French, Arabic, Turkish and Spanish, and is a part of a larger private dataset.

4 PAPERS • 1 BENCHMARK

OLR 2021

The OLR 2021 dataset contains the data for the Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios.

4 PAPERS • NO BENCHMARKS YET

Open Images V7

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.

4 PAPERS • NO BENCHMARKS YET

RTASC

RTASC (ROBIN Technical Acquisition Speech Corpus)

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

4 PAPERS • NO BENCHMARKS YET

Timers and Such

Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. The dataset has four intents, corresponding to four common offline voice assistant uses: SetTimer, SetAlarm, SimpleMath, and UnitConversion. The semantic label for each utterance is a dictionary with the intent and a number of slots.

4 PAPERS • 1 BENCHMARK

LibriVoxDeEn

LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.

3 PAPERS • NO BENCHMARKS YET

ODSQA

ODSQA (Open-Domain Spoken Question Answering)

The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.

3 PAPERS • NO BENCHMARKS YET

TaL Corpus (The Tongue and Lips Corpus)

The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.

3 PAPERS • NO BENCHMARKS YET

ADIMA

ADIMA is a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.

2 PAPERS • NO BENCHMARKS YET

AV Digits Database

AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.

2 PAPERS • NO BENCHMARKS YET

BD-4SK-ASR

BD-4SK-ASR (Basic Dataset for Sorani Kurdish Automatic Speech Recognition)

The Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR) is a dataset for automatic speech recognition for Sorani Kurdish.

2 PAPERS • NO BENCHMARKS YET

ESB (End-to-End Speech Benchmark)

ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, acoustic conditions, speaker styles, and transcription requirements.

2 PAPERS • NO BENCHMARKS YET

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2].

2 PAPERS • 1 BENCHMARK

MASRI-HEADSET

MASRI-HEADSET is a corpus that was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender.

2 PAPERS • NO BENCHMARKS YET

NusaCrowd

NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.

2 PAPERS • NO BENCHMARKS YET

Tilde MODEL Corpus

Tilde MODEL Corpus (Tilde Multilingual Open Data for European Languages)

Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.

2 PAPERS • NO BENCHMARKS YET

ArzEn

ArzEn (Corpus of Egyptian Arabic-English Code-switching)

Corpus of Egyptian Arabic-English Code-switching (ArzEn) is a spontaneous conversational speech corpus, obtained through informal interviews held at the German University in Cairo. The participants discussed broad topics, including education, hobbies, work, and life experiences. The corpus currently contains 12 hours of speech, having 6,216 utterances. The recordings were transcribed and translated into monolingual Egyptian Arabic and monolingual English.

1 PAPER • NO BENCHMARKS YET

CrowdSpeech

CrowdSpeech is a publicly available large-scale dataset of crowdsourced audio transcriptions. It contains annotations for more than 20 hours of English speech from more than 1,000 crowd workers.

1 PAPER • 2 BENCHMARKS

DEEP-VOICE: DeepFake Voice Recognition (Jordan Bird)

DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

1 PAPER • 1 BENCHMARK

EmoSpeech

EmoSpeech contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis.

1 PAPER • NO BENCHMARKS YET

FT Speech

FT Speech is a speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data.

1 PAPER • NO BENCHMARKS YET

Fongbe audio

Fongbe audio (Fongbe dataset)

Fongbe Data collected by Fréjus A. A LALEYE

1 PAPER • 1 BENCHMARK

GOLOS

Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.

1 PAPER • NO BENCHMARKS YET

Jam-ALT

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

1 PAPER • 5 BENCHMARKS

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 PAPER • NO BENCHMARKS YET

MSNER

MSNER (Multilingual Spoken Named Entity Recognition)

This dataset contains named entities annotations for European Parliament recordings in Dutch, French, German and Spanish. The entity annotation scheme follows OntoNotes v5. The original unannotated dataset is VoxPopuli.

1 PAPER • NO BENCHMARKS YET

Persian Preschool Cognition Speech

Data collection was conducted by asking some adults from social media and some students from an elementary school to participate in our experiment. Table.1 shows the number of data gathered for recognizing each color. Due to the fact that two words are used for black in Persian, the number of black samples is more. In addition, because the color recognition is a RAN task, a sequence of data has been gathered. Table.2 depicts the number of sequence data for colors. For the meaningless words, 12 voices have been gathered on average for each word (there are 40 meaningless words in this task).

1 PAPER • NO BENCHMARKS YET

RadioTalk

RadioTalk is a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information.

1 PAPER • NO BENCHMARKS YET

VietMed

VietMed (VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain)

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.

1 PAPER • 2 BENCHMARKS

Vāksañcayaḥ

Vāksañcayaḥ (Sanskrit Speech Corpus by IIT Bombay)

This Sanskrit speech corpus has more than 78 hours of audio data and contains recordings of 45,953 sentences with a sampling rate of 22KHz. The content is mainly readings of texts spanning over various Śāstras of Saṃskṛtam literature and also includes contemporary stories, radio program, extempore discourse, etc.

1 PAPER • NO BENCHMARKS YET

Zambezi Voice

This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.

1 PAPER • NO BENCHMARKS YET

i3-video

i3-video (is-it-instructional-video)

The i3-video dataset contains "is-it-instructional" annotations for 6.4k videos from Youtube-8M. The videos are considered to be instructional if they focus on real-world human actions accompanied by procedural language that explains what’s happening on screen in reasonable details.

1 PAPER • NO BENCHMARKS YET

RESD (Russian Emotional Speech Dialogs with annotated text)

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>

0 PAPER • NO BENCHMARKS YET

aidatatang_200zh

A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:

0 PAPER • NO BENCHMARKS YET

Datasets

87 dataset results for Speech Recognition