1 code implementation • 2 Jan 2024 • Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment.
Ranked #5 on Audio Generation on AudioCaps
1 code implementation • 27 Dec 2023 • Qifei Li, Yingming Gao, Cong Wang, Yayue Deng, Jinlong Xue, Yichen Han, Ya Li
To address this problem, we propose a frame-level emotional state alignment method for SER.
no code implementations • 16 Dec 2023 • Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping Wang, Yingming Gao, Dengfeng Ke, Ya Li
In this paper, we introduce a contrastive learning-based CSS framework, CONCSS.
no code implementations • 5 Jun 2023 • Dengfeng Ke, Yayue Deng, Yukang Jia, Jinlong Xue, Qi Luo, Ya Li, Jianqing Sun, Jiaen Liang, Binghuai Lin
Regressive Text-to-Speech (TTS) system utilizes attention mechanism to generate alignment between text and acoustic feature sequence.
no code implementations • 3 May 2023 • Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, JianHua Tao, Jianqing Sun, Jiaen Liang
However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis.
no code implementations • 7 Oct 2022 • Yichen Han, Ya Li, Yingming Gao, Jinlong Xue, Songpo Wang, Lei Yang
Then we used keypoint decomposition to extract video synthesis controlling parameters from the backend output and the source image.
1 code implementation • 20 Mar 2022 • Jinlong Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, Jiaen Liang
In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress.