no code implementations • CAI (COLING) 2022 • Zhuo Gong, Daisuke Saito, Sheng Li, Hisashi Kawai, Nobuaki Minematsu
The experiments show that we can enhance an ASR E2E model based on encoder-decoder architecture by pre-training the decoder with text data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 18 Dec 2023 • Peng Shen, Xuguang Lu, Hisashi Kawai
Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance.
no code implementations • 18 Dec 2023 • Peng Shen, Xugang Lu, Hisashi Kawai
Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed.
no code implementations • 20 Oct 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Our previous study discovered that completely aligning the distributions between the source and target domains can introduce a negative transfer, where classes or irrelevant classes from the source domain map to a different class in the target domain during distribution alignment.
no code implementations • 28 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 24 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 29 Jul 2022 • Peng Shen, Xugang Lu, Hisashi Kawai
For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 8 Apr 2022 • Peng Shen, Xugang Lu, Hisashi Kawai
The acoustic and linguistic features are important cues for the spoken language identification (LID) task.
no code implementations • 31 Mar 2022 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT).
no code implementations • 7 Apr 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
However, in most of the discriminative training for SiamNN, only the distribution of pair-wised sample distances is considered, and the additional discriminative information in joint distribution of samples is ignored.
no code implementations • 1 Mar 2021 • Aly Magassouba, Komei Sugiura, Hisashi Kawai
Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users.
no code implementations • 12 Feb 2021 • Aly Magassouba, Komei Sugiura, Angelica Nakayama, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Hisashi Kawai
Thus, inferring the collision-risk before a placing motion is crucial for achieving the requested task.
no code implementations • 9 Jan 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we train the model parameters with the pairwise samples as a binary discrimination task.
no code implementations • 24 Dec 2020 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced.
1 code implementation • 25 Jul 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda
To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature.
no code implementations • 9 Jul 2020 • Tadashi Ogura, Aly Magassouba, Komei Sugiura, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Hisashi Kawai
Domestic service robots (DSRs) are a promising solution to the shortage of home care workers.
1 code implementation • 18 May 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG.
Audio and Speech Processing Sound
no code implementations • 27 Dec 2019 • Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai
However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation.
no code implementations • 23 Dec 2019 • Aly Magassouba, Komei Sugiura, Hisashi Kawai
To solve such a task, we propose the multimodal target-source classifier model with attention branches (MTCM-AB), which is an extension of the MTCM.
no code implementations • 10 Sep 2019 • Aly Magassouba, Komei Sugiura, Hisashi Kawai
In this paper, we address the automatic sentence generation of fetching instructions for domestic service robots.
no code implementations • 30 Apr 2019 • Chien-Feng Liao, Yu Tsao, Xugang Lu, Hisashi Kawai
In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm.
no code implementations • 11 Jun 2018 • Aly Magassouba, Komei Sugiura, Hisashi Kawai
This paper focuses on a multimodal language understanding method for carry-and-place tasks with domestic service robots.
no code implementations • 16 Jan 2018 • Komei Sugiura, Hisashi Kawai
The target task of this study is grounded language understanding for domestic service robots (DSRs).
no code implementations • 12 Sep 2017 • Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, Hisashi Kawai
For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 7 Mar 2017 • Szu-Wei Fu, Yu Tsao, Xugang Lu, Hisashi Kawai
Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform.