🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

76 dataset results for Speech AND English

Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embeddings. Contains thousands of sentence pairs annotated by humans for semantic similarity. The spoken sentences can be used in sentence embedding models to test whether your model learns to capture sentence semantics. All sentences available in 6 synthetic Wavenet voices and a subset (5%) in 4 real voices recorded in a sound attenuated booth. Code to train a visually grounded spoken sentence embedding model and evaluation code is available at https://github.com/DannyMerkx/speech2image/tree/Interspeech21

3 PAPERS • NO BENCHMARKS YET

TaL Corpus (The Tongue and Lips Corpus)

The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.

3 PAPERS • NO BENCHMARKS YET

AV Digits Database

AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.

2 PAPERS • NO BENCHMARKS YET

ESB (End-to-End Speech Benchmark)

ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, acoustic conditions, speaker styles, and transcription requirements.

2 PAPERS • NO BENCHMARKS YET

Fingerprint Dataset

Fingerprint Dataset (Neural Audio Fingerprint Dataset)

This dataset includes all music sources, background noises and impulse-reponses (IR) samples and conversation speech that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ICASSP 2021 (https://arxiv.org/abs/2010.11910).

2 PAPERS • NO BENCHMARKS YET

Google Speech Commands - Musan

This noisy speech test set is created from the Google Speech Commands v2 [1] and the Musan dataset[2].

2 PAPERS • 1 BENCHMARK

RyanSpeech

RyanSpeech is a speech corpus for research on automated text-to-speech (TTS) systems. This dataset contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.

2 PAPERS • NO BENCHMARKS YET

CSI

CSI is a criminal conversational dataset for speaker identification built from the CSI television show. The authors collected transcripts of 39 episodes and video/audio of 4 episodes. Each episode involves on average more than 30 speakers. Utterances last on average 3 to 4 seconds. There are around 45 to 50 distinct scenes/conversations per episode.

1 PAPER • NO BENCHMARKS YET

CrowdSpeech

CrowdSpeech is a publicly available large-scale dataset of crowdsourced audio transcriptions. It contains annotations for more than 20 hours of English speech from more than 1,000 crowd workers.

1 PAPER • 2 BENCHMARKS

DR-VCTK

DR-VCTK (Device Recorded VCTK)

This dataset is a new variant of the voice cloning toolkit (VCTK) dataset: device-recorded VCTK (DR-VCTK), where the high-quality speech signals recorded in a semi-anechoic chamber using professional audio devices are played back and re-recorded in office environments using relatively inexpensive consumer devices.

1 PAPER • NO BENCHMARKS YET

EVI

The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).

1 PAPER • 3 BENCHMARKS

EdAcc

EdAcc (Edinburgh International Accents of English Corpus)

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset composed of 40 hours of English dyadic conversations between speakers with a diverse set of accents. EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker.

1 PAPER • NO BENCHMARKS YET

EmoSpeech

EmoSpeech contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis.

1 PAPER • NO BENCHMARKS YET

GOTCHA

We release the dataset for non-commercial research. Submit requests <a href="https://forms.gle/6WPEGNWbYoEe6bte8" target="_blank">here</a>.

1 PAPER • NO BENCHMARKS YET

Jam-ALT

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

1 PAPER • 5 BENCHMARKS

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 PAPER • NO BENCHMARKS YET

LibriS2S

LibriS2S is a Speech to Speech Translation (S2ST) dataset build further upon existing resources. The dataset provides English-German speech and text quadruplets ranging just over 50 hours for both languages.

1 PAPER • NO BENCHMARKS YET

MC_GRID

MC_GRID (Multi_Channel_Grid)

Here we release the dataset (Multi_Channel_Grid, abbreviated as MC_Grid) used in our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

1 PAPER • NO BENCHMARKS YET

NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling

We announce the release of a new multilingual speaker dataset called NITK-IISc Multilingual Multi-accent Speaker Profiling(NISP) dataset. The dataset contains speech in six different languages -- five Indian languages along with Indian English. The dataset contains speech data from 345 bilingual speakers in India. Each speaker has contributed about 4-5 minutes of data that includes recordings in both English and their mother tongue. The transcript for the text is provided in UTF-8 format. For every speaker, the dataset contains speaker meta-data such as L1, native place, medium of instruction, current residing place etc. In addition the dataset also contains physical parameter information of the speakers such as age, height, shoulder size and weight. We hope that the dataset is useful for a diverse set of research activities including multilingual speaker recognition, language and accent recognition, automatic speech recognition etc.

1 PAPER • NO BENCHMARKS YET

NISQA Speech Quality Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.

1 PAPER • NO BENCHMARKS YET

SDN (Situated Dialogue Navigation)

Situated Dialogue Navigation (SDN) is a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions.

1 PAPER • NO BENCHMARKS YET

Spot the Difference Corpus

Spot the Difference Corpus is a corpus of task-oriented spontaneous dialogues which contains 54 interactions between pairs of subjects interacting to find differences in two very similar scenes. The corpus includes rich transcriptions, annotations, audio and video.

1 PAPER • NO BENCHMARKS YET

The Spoken Wikipedia Corpora

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

1 PAPER • 1 BENCHMARK

VESUS

VESUS (Varied Emotion in Syntactically Uniform Speech)

The Varied Emotion in Syntactically Uniform Speech (VESUS) repository is a lexically controlled database collected by the NSA lab. Here, actors read a semantically neutral script of words, phrases, and sentences with different emotional inflections. VESUS contains 252 distinct phrases, each read by 10 actors in 5 emotional states (neutral, angry, happy, sad, fearful).

1 PAPER • NO BENCHMARKS YET

WHAMR_ext

WHAMR_ext is an extension to the WHAMR corpus with larger RT60 values (between 1s and 3s)

1 PAPER • 1 BENCHMARK

Well-being Dataset

Well-being Dataset (Cambridge Well-being Dataset for Psychological Distress Analysis)

The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by human volunteers. The dataset consists of 30-min interview recordings of participants.

1 PAPER • 2 BENCHMARKS

twitter politicians data

Dataset based on Twitter usernames of American politicians. Data extracted from Wikidata.

1 PAPER • NO BENCHMARKS YET

FluencyBank

FluencyBank is a shared database for the study of fluency development. Participants include typically-developing monolingual and bilingual children, children and adults who stutter (C/AWS) or who clutter (C/AWC), and second language learners.

0 PAPER • NO BENCHMARKS YET

Datasets

76 dataset results for Speech AND English