no code implementations • 25 Dec 2023 • Aditya Ravuri, Erica Cooper, Junichi Yamagishi
Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale.
1 code implementation • 8 Oct 2023 • Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah
That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss.
no code implementations • 4 Oct 2023 • Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.
1 code implementation • 28 May 2023 • Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi
To properly measure misclassified ranges and better evaluate spoof localization performance, we upgrade point-based EER to range-based EER.
1 code implementation • Interspeech 2023 • Chang Zeng, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi
The ability of countermeasure models to generalize from seen speech synthesis methods to unseen ones has been investigated in the ASVspoof challenge.
no code implementations • 17 May 2023 • Erica Cooper, Junichi Yamagishi
Mean Opinion Score (MOS) is a popular measure for evaluating synthesized speech.
no code implementations • 1 Sep 2022 • Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi
Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring.
no code implementations • 11 Apr 2022 • Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi
Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time.
1 code implementation • 18 Oct 2021 • Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda
An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores.
1 code implementation • 6 Oct 2021 • Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi
Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test.
no code implementations • 4 Oct 2021 • Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
Are end-to-end text-to-speech (TTS) models over-parametrized?
1 code implementation • 24 Jul 2021 • Xuan Shi, Erica Cooper, Junichi Yamagishi
Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer.
1 code implementation • 4 May 2021 • Jennifer Williams, Jason Fong, Erica Cooper, Junichi Yamagishi
This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data.
no code implementations • 6 Apr 2021 • Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans
By definition, partially-spoofed utterances contain a mix of both spoofed and bona fide segments, which will likely degrade the performance of countermeasures trained with entirely spoofed utterances.
1 code implementation • 4 Apr 2021 • Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi
Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities.
Ranked #1 on Speaker Verification on CN-CELEB
no code implementations • 10 Nov 2020 • Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi
We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis.
no code implementations • 21 Oct 2020 • Antoine Perquin, Erica Cooper, Junichi Yamagishi
Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.
1 code implementation • 21 Oct 2020 • Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi
Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-supervised with global conditions.
1 code implementation • 4 May 2020 • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi
This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach.
3 code implementations • 23 Oct 2019 • Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi
While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers.
Audio and Speech Processing