no code implementations • 16 May 2020 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen
To address this problem and improve the inference speed, we propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence.
no code implementations • 11 May 2020 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Without beam-search, the one-pass propagation much reduces inference time cost of LASO.
no code implementations • 6 Apr 2020 • Cunhang Fan, Jian-Hua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen
In this paper, we propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC).
no code implementations • 1 Apr 2020 • Jiangyan Yi, Jian-Hua Tao, Ye Bai, Zhengkun Tian, Cunhang Fan
The other is that POS tags are provided by an external POS tagger.
no code implementations • 17 Mar 2020 • Cunhang Fan, Jian-Hua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen, Xuefei Liu
Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech.
no code implementations • 19 Feb 2020 • Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jian-Hua Tao, Ye Bai
Recently, language identity information has been utilized to improve the performance of end-to-end code-switching (CS) speech recognition.
no code implementations • 5 Feb 2020 • Cunhang Fan, Bin Liu, Jian-Hua Tao, Jiangyan Yi, Zhengqi Wen
Specifically, we apply the deep clustering network to extract deep embedding features.
no code implementations • 6 Dec 2019 • Zhengkun Tian, Jiangyan Yi, Ye Bai, Jian-Hua Tao, Shuai Zhang, Zhengqi Wen
Once a fixed-length chunk of the input sequence is processed by the encoder, the decoder begins to predict symbols immediately.
no code implementations • 4 Dec 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengqi Wen, Zhengkun Tian, Shuai Zhang
To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence.
no code implementations • 24 Oct 2019 • Zheng Lian, Jian-Hua Tao, Bin Liu, Jian Huang
Prior works on speech emotion recognition utilize various unsupervised learning approaches to deal with low-resource samples.
no code implementations • 24 Oct 2019 • Zheng Lian, Jian-Hua Tao, Bin Liu, Jian Huang
Different from the emotion recognition in individual utterances, we propose a multimodal learning framework using relation and dependencies among the utterances for conversational emotion analysis.
no code implementations • 24 Oct 2019 • Zheng Lian, Jian-Hua Tao, Bin Liu, Jian Huang
The secondary task is to learn a common representation where speaker identities can not be distinguished.
no code implementations • 23 Oct 2019 • Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang, Ming-Yue Niu
To sum up, the contributions of this paper lie in two areas: 1) We visualize concerned areas of human faces in emotion recognition; 2) We analyze the contribution of different face areas to different emotions in real-world conditions through experimental analysis.
no code implementations • 23 Oct 2019 • Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang
It outperforms the baseline system that is optimized without the contrastive loss function with 1. 14% and 2. 55% in the weighted accuracy and the unweighted accuracy, respectively.
no code implementations • 28 Sep 2019 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Zhengqi Wen
Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance.
no code implementations • 23 Jul 2019 • Cunhang Fan, Bin Liu, Jian-Hua Tao, Jiangyan Yi, Zhengqi Wen
Firstly, a DC network is trained to extract deep embedding features, which contain each source's information and have an advantage in discriminating each target speakers.
1 code implementation • 18 Jul 2019 • Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jian-Hua Tao
Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with a MOS gap of 0. 14 in a challenging test, and achieving close to human quality (4. 42 vs. 4. 49 in MOS) on general test.
no code implementations • 13 Jul 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen
Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial.
no code implementations • 17 Apr 2019 • Jia Li, Xiao Sun, Xing Wei, Changliang Li, Jian-Hua Tao
In recent years, the generation of conversation content based on deep neural networks has attracted many researchers.
no code implementations • 11 Nov 2018 • Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang
I have submitted a new version to arXiv:1910. 13806.
1 code implementation • 13 Sep 2018 • Zheng Lian, Ya Li, Jian-Hua Tao, Jian Huang
We test our method in the EmotiW 2018 challenge and we gain promising results.
no code implementations • 20 Feb 2018 • Jiangyan Yi, Jian-Hua Tao, Zhengqi Wen, Bin Liu
The close-talking model is called the teacher model.
no code implementations • 28 Mar 2016 • Linlin Chao, Jian-Hua Tao, Minghao Yang, Ya Li, Zhengqi Wen
The other one is locating and re-weighting the perception attentions in the whole audio-visual stream for better recognition.