1 code implementation • 26 Mar 2024 • Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen
Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling.
1 code implementation • 13 Mar 2024 • John Martinsson, Olof Mogren, Maria Sandsten, Tuomas Virtanen
In this work we propose an audio recording segmentation method based on an adaptive change point detection (A-CPD) for machine guided weak label annotation of audio recording segments.
no code implementations • 11 Jan 2024 • Mikko Heikkinen, Archontis Politis, Tuomas Virtanen
Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays.
no code implementations • 17 Dec 2023 • Yuzhu Wang, Archontis Politis, Tuomas Virtanen
The clean speech clips from WSJ0 are employed for simulating speech signals of moving speakers in a reverberant environment.
no code implementations • 9 Aug 2023 • Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen
Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment.
1 code implementation • 16 Jun 2023 • Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen
Conversely, the results suggest that using only binary relevances defined by captioning-based audio-caption pairs is sufficient for contrastive learning.
1 code implementation • NeurIPS 2023 • Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji
While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.
no code implementations • 14 Jun 2023 • David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen
Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state.
no code implementations • 5 Jun 2023 • Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen
As a result, we find that sequential training with wav2vec 2. 0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms.
no code implementations • 31 May 2023 • Parthasaarathy Sudarsanam, Tuomas Virtanen
On the yes/no binary classification task, our proposed model achieves an accuracy of 68. 3% compared to 62. 7% in the reference model.
1 code implementation • 29 Apr 2023 • Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen
In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings.
no code implementations • 14 Mar 2023 • Wang Dai, Archontis Politis, Tuomas Virtanen
Specifically, each mask is used to multiply the corresponding channel's 2D representation, and the masked output of all channels are then summed.
no code implementations • 8 Nov 2022 • Huang Xie, Okko Räsänen, Tuomas Virtanen
With a constant training setting on the retrieval system from [1], we study eight sampling strategies, including hard and semi-hard negative sampling.
no code implementations • 26 Oct 2022 • David Diaz-Guerra, Archontis Politis, Tuomas Virtanen
Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios.
no code implementations • 20 Sep 2022 • Huang Xie, Samuel Lipping, Tuomas Virtanen
Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset.
1 code implementation • 4 Aug 2022 • Yanxiong Li, Wenchang Cao, Konstantinos Drossos, Tuomas Virtanen
Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people.
1 code implementation • 13 Jun 2022 • Huang Xie, Samuel Lipping, Tuomas Virtanen
Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset.
no code implementations • 10 Jun 2022 • Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen
The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.
no code implementations • 8 Jun 2022 • Irene Martín-Morató, Francesco Paissan, Alberto Ancilotto, Toni Heittola, Annamaria Mesaros, Elisabetta Farella, Alessio Brutti, Tuomas Virtanen
The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46. 5 K parameters, and 29. 23 million multiply-and-accumulate operations (MMACs).
2 code implementations • 4 Jun 2022 • Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen
Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.
Ranked #1 on Sound Event Localization and Detection on STARSS22
no code implementations • 2 Jun 2022 • Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen
In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content.
no code implementations • 20 Apr 2022 • Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen
Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer.
2 code implementations • 29 Oct 2021 • Sharath Adavanne, Archontis Politis, Tuomas Virtanen
Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem.
1 code implementation • 6 Oct 2021 • Huang Xie, Okko Räsänen, Konstantinos Drossos, Tuomas Virtanen
We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip.
1 code implementation • 12 Jul 2021 • Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley
The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening.
no code implementations • 28 Jun 2021 • Pasi Pertilä, Emre Cakir, Aapo Hakala, Eemi Fagerlund, Tuomas Virtanen, Archontis Politis, Antti Eronen
Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants.
no code implementations • 22 Jun 2021 • Shanshan Wang, Gaurav Naithani, Archontis Politis, Tuomas Virtanen
Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation.
1 code implementation • 13 Jun 2021 • Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen
This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD).
1 code implementation • 28 May 2021 • Irene Martín-Morató, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen
The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70% accuracy, and log loss under 0. 8.
no code implementations • 28 May 2021 • Shanshan Wang, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen
More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams.
no code implementations • 25 Nov 2020 • Huang Xie, Okko Räsänen, Tuomas Virtanen
In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes.
no code implementations • 24 Nov 2020 • Huang Xie, Tuomas Virtanen
The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training.
1 code implementation • 27 Oct 2020 • Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra
In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism.
no code implementations • 22 Oct 2020 • Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen
This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs).
no code implementations • 22 Oct 2020 • Slobodan Djukanović, Jiři Matas, Tuomas Virtanen
The method is trained and tested on a traffic-monitoring dataset comprising $422$ short, $20$-second one-channel sound files with a total of $ 1421 $ vehicles passing by the microphone.
1 code implementation • 21 Oct 2020 • An Tran, Konstantinos Drossos, Tuomas Virtanen
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i. e. a caption) of its contents.
4 code implementations • 6 Sep 2020 • Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen
A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset.
no code implementations • 10 Jul 2020 • Konstantinos Drossos, Stylianos I. Mimilakis, Tuomas Virtanen
Sound event detection (SED) is the task of identifying sound events along with their onset and offset times.
1 code implementation • 9 Jul 2020 • Emre Çakır, Konstantinos Drossos, Tuomas Virtanen
Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio.
1 code implementation • 6 Jul 2020 • Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen
In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.
no code implementations • 6 Jul 2020 • Pyry Pyykkönen, Styliannos I. Mimilakis, Konstantinos Drossos, Tuomas Virtanen
We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs).
2 code implementations • 15 Jun 2020 • Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra
Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features.
2 code implementations • 2 Jun 2020 • Archontis Politis, Sharath Adavanne, Tuomas Virtanen
This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge.
no code implementations • 29 May 2020 • Toni Heittola, Annamaria Mesaros, Tuomas Virtanen
This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge.
no code implementations • 12 Feb 2020 • Shuyang Zhao, Toni Heittola, Tuomas Virtanen
Training with recordings as context outperforms training with only annotated segments.
1 code implementation • 2 Feb 2020 • Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen
The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions.
no code implementations • 1 Nov 2019 • Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti
The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2. 7%.
7 code implementations • 21 Oct 2019 • Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen
Audio captioning is the novel task of general audio content description using free text.
1 code implementation • 22 Jul 2019 • Samuel Lipping, Konstantinos Drossos, Tuomas Virtanen
In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets.
Sound Audio and Speech Processing
1 code implementation • Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 2019 • Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen
On the contrary, with our method there is a decrease of 4% at F1 score and an increase of 7% at ER for the TUT-SED Synthetic 2016 dataset.
3 code implementations • 21 May 2019 • Sharath Adavanne, Archontis Politis, Tuomas Virtanen
This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge.
Sound Audio and Speech Processing
no code implementations • 6 May 2019 • Huang Xie, Tuomas Virtanen
We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings.
1 code implementation • 30 Apr 2019 • Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, Tara Sainath
Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing.
1 code implementation • 29 Apr 2019 • Sharath Adavanne, Archontis Politis, Tuomas Virtanen
This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN).
1 code implementation • 24 Apr 2019 • Konstantinos Drossos, Paul Magron, Tuomas Virtanen
A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions.
1 code implementation • 17 Aug 2018 • Shayan Gharib, Konstantinos Drossos, Emre Çakır, Dmitriy Serdyuk, Tuomas Virtanen
A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy.
no code implementations • 2 Aug 2018 • Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen
In this paper we study the problem of acoustic scene classification, i. e., categorization of audio sequences into mutually exclusive classes based on their spectral content.
2 code implementations • 25 Jul 2018 • Annamaria Mesaros, Toni Heittola, Tuomas Virtanen
This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task.
8 code implementations • 30 Jun 2018 • Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space.
Sound Audio and Speech Processing
no code implementations • 9 May 2018 • Emre Çakır, Tuomas Virtanen
Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a mapping between these features and the target sound events using a classifier.
2 code implementations • 1 Feb 2018 • Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio
Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods.
Sound Audio and Speech Processing
no code implementations • 29 Jan 2018 • Sharath Adavanne, Archontis Politis, Tuomas Virtanen
Each of this dataset has a four-channel first-order Ambisonic, binaural, and single-channel versions, on which the performance of SED using the proposed method are compared to study the potential of SED using multichannel audio.
no code implementations • 4 Nov 2017 • Stylianos Ioannis Mimilakis, Konstantinos Drossos, João F. Santos, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio
Singing voice separation based on deep learning relies on the usage of time-frequency masking.
Sound Audio and Speech Processing
no code implementations • 27 Oct 2017 • Sharath Adavanne, Archontis Politis, Tuomas Virtanen
This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources.
no code implementations • 30 Jun 2017 • Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen
The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder.
no code implementations • 7 Jun 2017 • Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen
This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks.
no code implementations • 7 Jun 2017 • Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, Tuomas Virtanen
In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task.
no code implementations • 7 Jun 2017 • Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina
This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space.
no code implementations • 7 Jun 2017 • Sharath Adavanne, Pasi Pertilä, Tuomas Virtanen
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection.
no code implementations • 7 Mar 2017 • EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen
Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions.
1 code implementation • 21 Feb 2017 • Emre Çakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, Tuomas Virtanen
Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure.
2 code implementations • 4 Apr 2016 • Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs).