VAST Absorption is a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls’ absorption coefficients.
1 PAPER • NO BENCHMARKS YET
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
1 PAPER • 2 BENCHMARKS
Virtuoso Strings is a dataset for soft onsets detection for string instruments. It consists of over 144 recordings of professional performances of an excerpt from Haydn's string quartet Op. 74 No. 1 Finale, each with corresponding individual instrumental onset annotations.
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
1 PAPER • 1 BENCHMARK
WHAMR_ext is an extension to the WHAMR corpus with larger RT60 values (between 1s and 3s)
The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by human volunteers. The dataset consists of 30-min interview recordings of participants.
Manually labelled dataset of bird recordings from the species of interest inhabiting in the wetlands of the "Aiguamolls del Empord`{a}" natural park in Girona, Spain. The dataset includes 5,795 annotated audio clips generated from a source of 1,098 recordings retrieved from the Xeno-Canto portal, adding up to a total of 201.6 minutes (12,096 seconds) of vocalizations of different lengths, alongside with their corresponding annotations.
The Western Mediterranean Wetlands Bird Dataset is a collection of birds' vocalizations of different lengths that primarily consists of 5,795 labelled audio clips derived from 1,098 recordings, totalling 201.6 minutes or 12,096 seconds alongside with corresponding annotations. It also comes with Mel spectrogram version of the data, where an image represents a 1-second window of the original audio, resulting in a total of 17,536 spectrographic images. These are stored in matrix form within .npy files. These are the species covered:
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats, such as podcasts, lectures, news, corporate events \& promotional content, and, more broadly, videos from individual content creators. We refer to the paper for further information.
This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.
The Humbug Zooinverse dataset is a dataset of mosquito audio recordings. With over a thousand contributors, it contains 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events.
gtzan_music_speech is a dataset for music/speech discrimination. It consists of 120 tracks of 30 second length. Each class (music/speech) has 60 samples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.
This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.
We present a multilingual test set for conducting speech intelligibility tests in the form of diagnostic rhyme tests. The materials currently contain audio recordings in 5 languages and further extensions are in progress. For Mandarin Chinese, we provide recordings for a consonant contrast test as well as a tonal contrast test. Further information on the audio data, test procedure and software to set up a full survey which can be deployed on crowdsourcing platforms is provided in our paper [arXiv preprint] and GitHub repository. We welcome contributions to this open-source project.
Overview nEMO is a simulated dataset of emotional speech in the Polish language. The corpus contains over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language. The corpus is available for free under the Creative Commons license (CC BY-NC-SA 4.0).
This open-source dataset consists of 5.04 hours of transcribed English conversational speech beyond telephony, where 13 conversations were contained.
0 PAPER • NO BENCHMARKS YET
The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes The Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
The subset of audio samples from the AudioSet ontology which are licensed with Creative Commons. This set contains approximately 10,000 samples of 10s long clips, and is freely modifiable and distributable. Each clip has with it, its full label set and unique ID.
Blind Audio-Visual Localization (BAVL) Dataset consists of 20 audio-visual recordings of sound sources, which could be talking faces or music instruments. Most audio-visual recordings (19) are videos from Youtube except V8, which is from 1. Besides, the video V7 was also used in2, and V16 used in 3. All 20 videos are annotated by ourselves in a uniform manner. Details of the video sequences are listed in Table 1.
Bach chorales is a univariate time series based on chorales, where the task is to learn generative grammar. The dataset consists of single-line melodies of 100 Bach chorales (originally 4 voices). The melody line can be studied independently of other voices. The grand challenge is to learn a generative grammar for stylistically valid chorales.
The BirdVox-DCASE-20k dataset contains 20,000 ten-second audio recordings. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10. Out of these 20,000 recording, 10,017 (50.09%) contain at least one bird vocalization (either song, call, or chatter). The dataset is a derivative work of the BirdVox-full-night dataset, containing almost as much data but formatted into ten-second excerpts rather than ten-hour full night recordings.
CLO-43SD is a dataset for multi-class species identification in avian flight calls. It consists of 5,428 labeled audio clips of flight calls from 43 different species of North American woodwarblers (in the family Parulidae). The clips came from a variety of recording conditions, including clean recordings obtained using highly-directional shotgun microphones, recordings obtained from noisier field recordings using omnidirectional microphones, and recordings obtained from birds in captivity.
CLO-SWTH is a dataset for species-specific flight call identification for the Swainson’s Thrush. 179,111 labeled audio clips captured by remote acoustic sensors deployed in Ithaca, NY and NYC over the fall 2014 and spring 2015 migration seasons. Each clip is labeled to indicate whether it contains a flight call from the target species Swainson’s Thrush (SWTH), a flight call from a non-target species, or no flight call at all.
CLO-WTSP is a dataset for species-specific flight call identification for the White-Throated Sparrow. 16,703 labeled audio clips captured by remote acoustic sensors deployed in Ithaca, NY and NYC over the fall 2014 and spring 2015 migration seasons. Each clip is labeled to indicate whether it contains a flight call from the target species White-Throated Sparrow (WTSP), a flight call from a non-target species, or no flight call at all.
Chernobyl is a collection of 620 audio clips collected from unattended remote monitoring equipment in the Chernobyl Exclusion Zone (CEZ). This data was collected as part of the TREE (Transfer-Exposure-Effects) research project into the long-term effects of the Chernobyl accident on local ecology. The audio covers a range of birds and includes weather, large mammal and insect noise sampled across various CEZ environments, including abandoned village, grassland and forest areas.
The Couples Therapy corpus contains audio, video recordings and manual transcriptions of conversations between 134 real-life couples attending marital therapy. In each session, one person selected a topic that was discussed over 10 minutes with the spouse. At the end of the session, both speakers were rated separately on 33 “behavior codes” by multiple annotators based on the Couples Interaction and Social Support Rating Systems. Each behavior was rated on a Likert scale from 1, indicating absence, to 9, indicating strong presence. A session-level rating was obtained for each speaker by averaging the annotator ratings. This process was repeated for the spouse, resulting in 2 sessions per couple at a time. The total number of sessions per couple varied between 2 and 6.
DBR dataset is an environmental audio dataset created for the Bachelor's Seminar in Signal Processing in Tampere University of Technology. The samples in the dataset were collected from the online audio database Freesound. The dataset consists of three classes, each containing 50 samples, and the classes are 'dog', 'bird', and 'rain' (hence the name DBR).
Deeply vocal characterizer is a human nonverbal vocalization dataset. This sample dataset consists of about 0.6 hours(56.7 hours in the full set) of audio(16 kHz, 16-bit, mono) across 16 human nonverbal vocalization classes, including throat-clearing, coughing, laughing, panting, and etc. The audio contents are crowdsourced by the general public of South Korea.
DementiaBank is a shared database of multimedia interactions for the study of communication in dementia. The dataset contains 117 people diagnosed with Alzheimer Disease, and 93 healthy people, reading a description of an image. The principal task and benchmark is to classify each group.
The FSL4 dataset contains ~4000 user-contributed loops uploaded to Freesound. Loops were selected by searching Freesound for sounds with the query terms loop and bpm, and then automatically parsing the returned sound filenames, tags and textual descriptions to identify tempo annotations made by users. For example, a sound containing the tag 120bpm is considered to have a ground truth of 120 BPM.
This dataset was created for Fongbe automatic speech recognition task and contains about 3979 recordings of 13 participants reading a text written in Fongbe, one sentence at a time. Fongbe is a vernacular language spoken mainly in Benin, by more than 50% of the population, and a littke in Togo and in Nigeria. It’s an under-resourced because it lacks linguistics resources (speech corpus and text data) and very few websites provide textual data. In this dataset, each example contains the audio files and the associated text. The audio is high-quality (16-bit, 16kHz) recorded using an adroid app that we built for the need. The dataset is multi-speaker, containing recordings from 13 volunteers (male and female).
The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simul
ISMIR2004 is an audio dataset consisting of 6 genres with 729 excerpts of 30 seconds. It is a dataset used for musical genre classification. The training set consists of 320 classical music samples, 115 electronic music samples, 26 jazz blues samples, 45 metal/punk samples, 101 rock/pop samples and 122 world samples.
MAEC is a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&P 1500 companies.
This MCCS dataset is the first large-scale Mandarin Chinese Cued Speech dataset. This dataset covers 23 major categories of scenarios (e.g, communication, transportation and shoping) and 72 subcategories of scenarios (e.g, meeting, dating and introduction). It is recorded by four skilled native Mandarn Chinese Cued Speech cuers with portable cameras on the mobile phones. The Cued Speech videos are recorded with 30fps and 1280x720 format. We provide the raw Cued Speech videos, text file (with 1000 sentences) and corresponding annotations which contains two kind of data annotation. One is continuious video annotation with ELAN, the other is discrete audio annotations with Praat.
The dataset includes recordings from 10 different users teaching the robot different common kitchen objects, that consists of synchronized recordings from three cameras and a microphone mounted on the robot:
The MOBIO database consists of bi-modal (audio and video) data taken from 152 people. The database has a female-male ratio or nearly 1:2 (100 males and 52 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal database with both native and non-native English speakers.
MedleyDB 2.0 is a superset of the MedleyDB – a dataset of annotated, royalty-free multitrack recordings. The second iteration of the dataset includes 74 new multitrack recordings resulting in 194 songs in total.
The MIVIA audio events data set is composed of a total of 6000 events for surveillance applications, namely glass breaking, gun shots and screams. The 6000 events are divided into a training set (composed of 4200 events) and a test set (composed of 1800 events).
Mixing Secrets is an instrument recognition dataset containing 258 multi-track recordings sourced from the Mixing Secrets for The Small Studio website. The dataset was labelled to be consistent with MedleyDB format.
Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.
Music4All-Onion is a large-scale, multi-modal music dataset that expands the Music4All dataset by including 26 additional audio, video, and metadata features for 109,269 music pieces and provides a set of 252,984,396 listening records of 119,140 users, extracted from the online music platform Last.fm .
The Persian Consonant Vowel Combination (PCVC) dataset is a phoneme based speech dataset, and also the first free Persian speech dataset to help Persian speech researchers. This dataset contains of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples. The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. In each sample, sound starts with consonant and then there is a vowel sound and at last there is silent. length of silence is dependent on length of combination of consonant and vowel. For example if combination ends in 20000th data sample, so the rest of 10000 sample (until 30000, the length of each sound sample) are silence.
PolandNFC is a collection of 4,000 recordings from Hanna Pamuła's PhD project of monitoring autumn nocturnal bird migration. The recordings were collected every night, from September to November 2016 on the Baltic Sea coast, Poland, using Song Meter SM2 units with microphones mounted on 3–5 m poles. A subset derived from 15 nights with different weather conditions and background noise including wind, rain, sea noise, insect calls, human voice and deer calls was used in DCASE 2018 Challenge.
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>
Robbie Williams is a dataset of 65 songs by Robbie Williams. It consists of chords, keys and beats. The dataset does not include audio.
STVD-FC is the largest public dataset on the political content analysis and fact-checking tasks. It consists of more than 1,200 fact-checked claims that have been scraped from a fact-checking service with associated metadata. For the video counterpart, the dataset contains nearly 6,730 TV programs, having a total duration of 6,540 hours, with metadata. These programs have been collected during the 2022 French presidential election with a dedicated workstation and protocol. The dataset is delivered as different parts for accessibility of the 2 TB of data and proper indexes. More information about the STVD-FC dataset can be found into the publication [1].
SimSceneTVB is a dataset of 600 simulated sound scenes of 45s each representing urban sound environments, simulated using the simScene Matlab library. The dataset is divided in two parts with a train subset (400 scenes) and a test subset (200 scenes) for the development of learning-based models. Each scene is composed of three main sources (traffic, human voices and birds) according to an original scenario, which is composed semi-randomly conditionally to five ambiances: park, quiet street, noisy street, very noisy street and square. Separate channels for the contribution of each source are available. The base audio files used for simulation are obtained from Freesound (https://freesound.org) and LibriSpeech (http://www.openslr.org/12). The sound scenes are scaled according to a playback sound level in dB, which is drawn randomly but remains plausible according to the ambiance.