Dataset for lyrics alignment and transcription evaluation. It contains 20 music pieces under CC license from the Jamendo website along with their lyrics, with:
3 PAPERS • NO BENCHMARKS YET
The LITIS-Rouen dataset is a dataset for audio scenes. It consists of 3026 examples of 19 scene categories. Each class is specific to a location such as a train station or an open market. The audio recordings have a duration of 30 seconds and a sampling rate of 22050 Hz. The dataset has a total duration of 1500 minutes.
The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.
The MISP2021 challenge dataset is a collection of audio-visual conversational data recorded in a home TV scenario using distant multi-microphones. The dataset captures interactions between several individuals who are engaged in conversations in Chinese while watching TV and interacting with a smart speaker/TV in a living room. The dataset is extensive, comprising 141 hours of audio and video data, which were collected using far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. Notably, this corpus is the first of its kind to offer a distant multimicrophone conversational Chinese audio-visual dataset. Furthermore, it is also the first large vocabulary continuous Chinese lip-reading dataset specifically designed for the adverse home-TV scenario.
Multimodal Dyadic Behavior (MMDB) dataset is a unique collection of multimodal (video, audio, and physiological) recordings of the social and communicative behavior of toddlers. The MMDB contains 160 sessions of 3-5 minute semi-structured play interaction between a trained adult examiner and a child between the age of 15 and 30 months. The MMDB dataset supports a novel problem domain for activity recognition, which consists of the decoding of dyadic social interactions between adults and children in a developmental context.
The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.7 of the corpus has 62,140 speaking turns (100hrs).
3 PAPERS • 4 BENCHMARKS
Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
The OLGA dataset contains artist similarities from AllMusic, together with content features from AcousticBrainz. With 17,673 artists, this is the largest academic artist similarity dataset that includes content-based features to date.
PACS (Physical Audiovisual CommonSense) is the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains a total of 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. The dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem.
3 PAPERS • 1 BENCHMARK
RWCP-SSD-Onomatopoeia is a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis.
The Surrey Audio-Visual Expressed Emotion (SAVEE) dataset was recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high quality audio-visual equipment, processed and labeled. To check the quality of performance, the recordings were evaluated by 10 subjects under audio, visual and audio-visual conditions. Classification systems were built using standard features and classifiers for each of the audio, visual and audio-visual modalities, and speaker-independent recognition rates of 61%, 65% and 84% achieved respectively.
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
LA-2A Compressor data to accompany the paper "SignalTrain: Profiling Audio Compressors with Deep Neural Networks," https://arxiv.org/abs/1905.11928
The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It has the difficulty of containing many artist or music tracks with uncommon names in the commands, like “play music by [a boogie wit da hoodie]” or “I’d like to listen to [Kinokoteikoku]”.
A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform. The Voice Conversion Challenge (VCC) 2016 was launched in 2016 at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodol
ADIMA is a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.
2 PAPERS • NO BENCHMARKS YET
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.
AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.
We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset f
The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.
BirdClef 2018 is a bird soundscape dataset based on the contributions of the Xeno-canto network. The training set contains 36,496 recordings covering 1500 species of central and south America (the largest bioacoustic dataset in the literature). There are about 68 hours of recordings in total, with 1,500 classes and species tags.
BirdClef 2019 is a bird soundscape dataset. It contains around 350 hours of manually annotated soundscapes using 30 field recorders between January and June of 2017 in Ithaca, NY, USA. There are around 50,000 recordings in the dataset in total, with 659 classes. The dataset also contains species tags.
The BirdVox-full-night dataset contains 6 audio recordings, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10. Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-full-night, but it is possible to tell apart thrushes from warblers and sparrrows by looking at the center frequencies of their calls. The annotation process took 102 hours.
COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.
CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 acoustic scenes.
2 PAPERS • 1 BENCHMARK
The DCASE 2017 rare sound events dataset contains isolated sound events for three classes: 148 crying babies (mean duration 2.25s), 139 glasses breaking (mean duration 1.16s), and 187 gun shots (mean duration 1.32s). As with the DCASE 2016 data, silences are not excluded from active event markings in the annotations. While this data set contains many samples per class, there are only three classes
The Distress Analysis Interview Corpus/Wizard-of-Oz set (DAIC-WOZ) dataset [24, 25] comprises voice and text samples from 189 interviewed healthy and control persons and their PHQ-8 depression detection questionnaire. This dataset is commonly used in research works for text-based detection, voice-based detection, and in multi-modal architecture
Expanded Groove MIDI dataset (E-GMD) is an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human-performed velocity annotations.
FINO-Net is a multimodal (RGB, depth and audio) dataset, containing 229 real-world manipulation data of 5 different manipulation types recorded with a Baxter robot.
This dataset includes all music sources, background noises and impulse-reponses (IR) samples and conversation speech that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ICASSP 2021 (https://arxiv.org/abs/2010.11910).
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:
ITALIC: An ITALian Intent Classification Dataset
LibriCount is a synthetic dataset for speaker count estimation. The dataset contains a simulated cocktail party environment of 0 to 10 speakers, mixed with 0dB SNR from random utterances of different speakers from the LibriSpeech CleanTest dataset. All recordings are of 5s durations, and all speakers are active for the most part of the recording.
MedleyVox is an evaluation dataset for multiple singing voices separation that corresponds to such categories. The problem definition in this dataset is categorised into i) duet, ii) unison, iii) main vs. rest, and iv) N-singing separation.
The MuseScore dataset is a collection of 344,166 audio and MIDI pairs downloaded from MuseScore website. The audio is usually synthesized by the MuseScore synthesizer. The audio clips have diverse musical genres and are about two mins long on average.
New refined labels for the MusicNet dataset obtained by the EM process as described in the paper: Ben Maman and Amit Bermano, "Unaligned Supervision for Automatic Music Transcription in The Wild"
NIPS4Bplus is a richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. It consists of around 687 recordings, 87 classes, species tags, annotations. The total duration of audio is around 1 hour.
The ObjectFolder Real dataset contains multisensory data collected from 100 real-world household objects. The visual data for each object include three high-quality 3D meshes of different resolutions and an HD video recording of the object rotating in a lightbox; The acoustic data for each object include impact sound recordings recorded at 30–50 points of the object, each of which is 6s long and is accompanied by the coordinate of the striking location on the object mesh, ground-truth contact force profile, and the accompanying video for the impact. The tactile data for each object include tactile readings at the same 30–50 points of the object, with each tactile reading as a video of the tactile RGB images that record the entire gel deformation process and is accompanied by two videos of the contact process from an in-hand camera and a third-view camera.
Open Broadcast Media Audio from TV (OpenBMAT) is an open, annotated dataset for the task of music detection that contains over 27 hours of TV broadcast audio from 4 countries distributed over 1647 one-minute long excerpts. It is designed to encompass several essential features for any music detection dataset and is the first one to include annotations about the loudness of music in relation to other simultaneous non-music sounds. OpenBMAT has been cross-annotated by 3 annotators obtaining high inter-annotator agreement percentages, which validates the annotation methodology and ensures the annotations reliability.
Introduction The 2016 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify heart sound recordings collected from a variety of clinical or nonclinical (such as in-home visits) environments. The aim is to identify, from a single short recording (10-60s) from a single precordial location, whether the subject of the recording should be referred on for an expert diagnosis.
Dataset of Room Impulse Responses measured at the Acoustic Technology group facilities, DTU Electro. The measurements were carried out in building 355, room 008, otherwise known as the "sound field control" room.
This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.
Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.
The TAU-NIGENS Spatial Sound Events 2020 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the sy
URBAN-SED is a dataset of 10,000 soundscapes with sound event annotations generated using the scraper library. The dataset includes 10,000 soundscapes, totals almost 30 hours and includes close to 50,000 annotated sound events. Every soundscape is 10 seconds long and has a background of Brownian noise resembling the typical “hum” often heard in urban environments. Every soundscape contains between 1-9 sound evnts from the following classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren and street_music. The source material for the sound events are the clips from the UrbanSound8K dataset. URBAN-SED comes pre-sorted into three sets: train, validate and test. There are 6000 soundscapes in the training set, generated using clips from folds 1-6 in UrbanSound8K, 2000 soundscapes in the validation set, generated using clips from fold 7-8 in UrbanSound8K, and 2000 soundscapes in the test set, generated using clips from folds 9-10 in