no code implementations • 6 Feb 2024 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume.
no code implementations • 31 Jan 2024 • Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D. Mehta, Nima Mesgarani
We also compare the feature extraction pathways of the LLMs to each other and identify new ways in which high-performing models have converged toward similar hierarchical processing mechanisms.
no code implementations • 27 Sep 2023 • Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios.
no code implementations • 18 Sep 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance.
no code implementations • 18 Jul 2023 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement.
1 code implementation • NeurIPS 2023 • Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
no code implementations • 29 May 2023 • Xilin Jiang, Yinghao Aaron Li, Nima Mesgarani
Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time.
no code implementations • 11 Feb 2023 • Cong Han, Vishal Choudhari, Yinghao Aaron Li, Nima Mesgarani
Auditory attention decoding (AAD) is a technique used to identify and amplify the talker that a listener is focused on in a noisy environment.
2 code implementations • 20 Jan 2023 • Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani
Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns.
1 code implementation • 29 Dec 2022 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models.
1 code implementation • 30 May 2022 • Yinghao Aaron Li, Cong Han, Nima Mesgarani
Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging.
2 code implementations • 21 Jul 2021 • Yinghao Aaron Li, Ali Zare, Nima Mesgarani
We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2.