1 code implementation • 15 Nov 2023 • Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, Yang Zhang
Uncertainty decomposition refers to the task of decomposing the total uncertainty of a model into data (aleatoric) uncertainty, resulting from the inherent complexity or ambiguity of the data, and model (epistemic) uncertainty, resulting from the lack of knowledge in the model.
no code implementations • 23 Jun 2023 • Zhongzhi Yu, Yang Zhang, Kaizhi Qian, Yonggan Fu, Yingyan Lin
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • CVPR 2023 • Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound.
1 code implementation • 2 Nov 2022 • Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan Lin
We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 20 Apr 2022 • Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks.
1 code implementation • 29 Mar 2022 • Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson
We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline.
1 code implementation • 29 Mar 2022 • Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson
An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 26 Mar 2022 • Chak Ho Chan, Kaizhi Qian, Yang Zhang, Mark Hasegawa-Johnson
SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner.
no code implementations • 4 Oct 2021 • Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
Are end-to-end text-to-speech (TTS) models over-parametrized?
1 code implementation • 16 Jun 2021 • Kaizhi Qian, Yang Zhang, Shiyu Chang, JinJun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson
In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.
no code implementations • NeurIPS 2021 • Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass
We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 21 Nov 2020 • Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott
Contemporary speech enhancement predominantly relies on audio transforms that are trained to reconstruct a clean speech waveform.
6 code implementations • ICML 2020 • Kaizhi Qian, Yang Zhang, Shiyu Chang, David Cox, Mark Hasegawa-Johnson
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
1 code implementation • 15 Apr 2020 • Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, Gautham J. Mysore
Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.
no code implementations • ICLR 2019 • Yang Zhang, Shiyu Chang, Mo Yu, Kaizhi Qian
The second paradigm, called the zero-confidence attack, finds the smallest perturbation needed to cause mis-classification, also known as the margin of an input feature.
no code implementations • 25 Sep 2019 • Hui Shi, Yang Zhang, Hao Wu, Shiyu Chang, Kaizhi Qian, Mark Hasegawa-Johnson, Jishen Zhao
Convolutional neural network (CNN) for time series data implicitly assumes that the data are uniformly sampled, whereas many event-based and multi-modal data are nonuniform or have heterogeneous sampling rates.
11 code implementations • 14 May 2019 • Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson
On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.
no code implementations • 15 Feb 2018 • Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florencio, Mark Hasegawa-Johnson
On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels.