1 code implementation • 27 Mar 2024 • Xilin Jiang, Cong Han, Nima Mesgarani
In this work, we replace transformers with Mamba, a selective state space model, for speech separation.
no code implementations • 6 Feb 2024 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume.
no code implementations • 31 Jan 2024 • Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D. Mehta, Nima Mesgarani
We also compare the feature extraction pathways of the LLMs to each other and identify new ways in which high-performing models have converged toward similar hierarchical processing mechanisms.
no code implementations • 27 Sep 2023 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios.
no code implementations • 18 Sep 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance.
no code implementations • 18 Jul 2023 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement.
1 code implementation • NeurIPS 2023 • Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
no code implementations • 29 May 2023 • Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani
Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time.
1 code implementation • 4 Apr 2023 • Gavin Mischler, Vinay Raghavan, Menoua Keshishian, Nima Mesgarani
Recently, the computational neuroscience community has pushed for more transparent and reproducible methods across the field.
no code implementations • 13 Mar 2023 • Cong Han, Nima Mesgarani
Binaural speech separation in real-world scenarios often involves moving speakers.
no code implementations • 11 Feb 2023 • Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani
Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment.
2 code implementations • 20 Jan 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns.
1 code implementation • 29 Dec 2022 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models.
1 code implementation • 30 May 2022 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging.
1 code implementation • NeurIPS 2021 • Menoua Keshishian, Samuel Norman-Haignere, Nima Mesgarani
We show that training causes these integration windows to shrink at early layers and expand at higher layers, creating a hierarchy of integration windows across the network.
2 code implementations • 21 Jul 2021 • Yinghao Aaron Li, Ali Zare, Nima Mesgarani
We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.
no code implementations • 17 Dec 2020 • Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen
Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years.
1 code implementation • 14 Dec 2020 • Yi Luo, Cong Han, Nima Mesgarani
A context codec module, containing a context encoder and a context decoder, is designed as a learnable downsampling and upsampling module to decrease the length of a sequential feature processed by the separation module.
no code implementations • 27 Mar 2020 • Yi Luo, Nima Mesgarani
Many recent source separation systems are designed to separate a fixed number of sources out of a mixture.
2 code implementations • 30 Oct 2019 • Yi Luo, Zhuo Chen, Nima Mesgarani, Takuya Yoshioka
An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones.
1 code implementation • 29 Sep 2019 • Yi Luo, Enea Ceolini, Cong Han, Shih-Chii Liu, Nima Mesgarani
Beamforming has been extensively investigated for multi-channel audio processing tasks.
16 code implementations • 20 Sep 2018 • Yi Luo, Nima Mesgarani
The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms.
Ranked #2 on Multi-task Audio Source Seperation on MTASS
Multi-task Audio Source Seperation Music Source Separation +3
1 code implementation • ISCA Interspeech 2018 • Yi Luo, Nima Mesgarani
We investigate the recently proposed Time-domain Audio Sep-aration Network (TasNet) in the task of real-time single-channel speech dereverberation.
Ranked #28 on Speech Separation on WSJ0-2mix
3 code implementations • 1 Nov 2017 • Yi Luo, Nima Mesgarani
We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs.
Ranked #30 on Speech Separation on WSJ0-2mix
1 code implementation • 26 Oct 2017 • Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani
In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos.
no code implementations • ICML 2017 • Tasha Nagamine, Nima Mesgarani
Despite the recent success of deep learning, the nature of the transformations they apply to the input features remains poorly understood.
no code implementations • 12 Jul 2017 • Yi Luo, Zhuo Chen, Nima Mesgarani
A reference point attractor is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space.
1 code implementation • 27 Nov 2016 • Zhuo Chen, Yi Luo, Nima Mesgarani
We propose a novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source.
no code implementations • 18 Nov 2016 • Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks.