Search Results for author: Hisashi Kawai

Found 25 papers, 2 papers with code

Can We Train a Language Model Inside an End-to-End ASR Model? - Investigating Effective Implicit Language Modeling

no code implementations • CAI (COLING) 2022 • Zhuo Gong, Daisuke Saito, Sheng Li, Hisashi Kawai, Nobuaki Minematsu

The experiments show that we can enhance an ASR E2E model based on encoder-decoder architecture by pre-training the decoder with text data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Generative linguistic representation for spoken language identification

no code implementations • 18 Dec 2023 • Peng Shen, Xuguang Lu, Hisashi Kawai

Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance.

Language Identification speech-recognition +2

Paper
Add Code

Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

no code implementations • 18 Dec 2023 • Peng Shen, Xugang Lu, Hisashi Kawai

Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed.

speaker-diarization Speaker Diarization +2

Paper
Add Code

Neural domain alignment for spoken language recognition based on optimal transport

no code implementations • 20 Oct 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

Our previous study discovered that completely aligning the distributions between the source and target domains can introduce a negative transfer, where classes or irrelevant classes from the source domain map to a different class in the target domain during distribution alignment.

Unsupervised Domain Adaptation

Paper
Add Code

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

no code implementations • 28 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Cross-modal Alignment with Optimal Transport for CTC-based ASR

no code implementations • 24 Sep 2023 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

no code implementations • 29 Jul 2022 • Peng Shen, Xugang Lu, Hisashi Kawai

For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Transducer-based language embedding for spoken language identification

no code implementations • 8 Apr 2022 • Peng Shen, Xugang Lu, Hisashi Kawai

The acoustic and linguistic features are important cues for the spoken language identification (LID) task.

Language Identification Spoken language identification

Paper
Add Code

Partial Coupling of Optimal Transport for Spoken Language Identification

no code implementations • 31 Mar 2022 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT).

Language Identification Spoken language identification +1

Paper
Add Code

Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

no code implementations • 7 Apr 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

However, in most of the discriminative training for SiamNN, only the distribution of pair-wised sample distances is considered, and the additional discriminative information in joint distribution of samples is ignored.

Binary Classification feature selection +1

Paper
Add Code

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

no code implementations • 1 Mar 2021 • Aly Magassouba, Komei Sugiura, Hisashi Kawai

Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users.

Translation Vision and Language Navigation

Paper
Add Code

Predicting and Attending to Damaging Collisions for Placing Everyday Objects in Photo-Realistic Simulations

no code implementations • 12 Feb 2021 • Aly Magassouba, Komei Sugiura, Angelica Nakayama, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Hisashi Kawai

Thus, inferring the collision-risk before a placing motion is crucial for achieving the requested task.

Paper
Add Code

Coupling a generative model with a discriminative learning framework for speaker verification

no code implementations • 9 Jan 2021 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we train the model parameters with the pairwise samples as a binary discrimination task.

Decision Making feature selection +1

Paper
Add Code

Unsupervised neural adaptation model based on optimal transport for spoken language identification

no code implementations • 24 Dec 2020 • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced.

Language Identification Spoken language identification

Paper
Add Code

Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

1 code implementation • 25 Jul 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature.

Paper
Code

Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder-Decoder Network

no code implementations • 9 Jul 2020 • Tadashi Ogura, Aly Magassouba, Komei Sugiura, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Hisashi Kawai

Domestic service robots (DSRs) are a promising solution to the shortage of home care workers.

Image Captioning Sentence

Paper
Add Code

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

1 code implementation • 18 May 2020 • Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG.

Audio and Speech Processing Sound

Paper
Code

Cross-scale Attention Model for Acoustic Event Classification

no code implementations • 27 Dec 2019 • Xugang Lu, Peng Shen, Sheng Li, Yu Tsao, Hisashi Kawai

However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation.

Classification General Classification

Paper
Add Code

A Multimodal Target-Source Classifier with Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

no code implementations • 23 Dec 2019 • Aly Magassouba, Komei Sugiura, Hisashi Kawai

To solve such a task, we propose the multimodal target-source classifier model with attention branches (MTCM-AB), which is an extension of the MTCM.

Sentence

Paper
Add Code

Multimodal Attention Branch Network for Perspective-Free Sentence Generation

no code implementations • 10 Sep 2019 • Aly Magassouba, Komei Sugiura, Hisashi Kawai

In this paper, we address the automatic sentence generation of fetching instructions for domestic service robots.

Sentence

Paper
Add Code

Incorporating Symbolic Sequential Modeling for Speech Enhancement

no code implementations • 30 Apr 2019 • Chien-Feng Liao, Yu Tsao, Xugang Lu, Hisashi Kawai

In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm.

Language Modelling Speech Enhancement

Paper
Add Code

A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

no code implementations • 11 Jun 2018 • Aly Magassouba, Komei Sugiura, Hisashi Kawai

This paper focuses on a multimodal language understanding method for carry-and-place tasks with domestic service robots.

Generative Adversarial Network

Paper
Add Code

Grounded Language Understanding for Manipulation Instructions Using GAN-Based Classification

no code implementations • 16 Jan 2018 • Komei Sugiura, Hisashi Kawai

The target task of this study is grounded language understanding for domestic service robots (DSRs).

Classification General Classification

Paper
Add Code

End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks

no code implementations • 12 Sep 2017 • Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, Hisashi Kawai

For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

no code implementations • 7 Mar 2017 • Szu-Wei Fu, Yu Tsao, Xugang Lu, Hisashi Kawai

Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform.

Denoising Speech Enhancement

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.