Adaptive Speech Duration Modification using a Deep-Generative Framework

29 Sep 2021 · Ravi Shankar, Archana Venkataraman ·

We propose the first method to adaptively modify the duration of a given speechsignal. Our approach uses a Bayesian framework to define a latent attention mapthat links frames of the input and target utterances. We train a masked convolu-tional encoder-decoder network to generate this attention map via a stochastic ver-sion of the mean absolute error loss function. Our model also predicts the lengthof the target speech signal using the encoder embeddings, which determines thenumber of time steps for the decoding operation. During testing, we generate theattention map as a proxy for the similarity matrix between the given input speechand an unknown target speech signal. Using this similarity matrix, we compute awarping path of alignment between the two signals. Our experiments demonstratethat this adaptive framework produces similar results to dynamic time warping,which relies on a known target signal, on both voice conversion and emotion con-version tasks. We also show that the modified speech utterances achieve high userquality ratings, thus highlighting the practical utility of our method.

PDF Abstract