Introduced by Kalchbrenner et al. in Efficient Neural Audio Synthesis

WaveRNN is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.

The overall computation in the WaveRNN is as follows (biases omitted for brevity):

$$ \mathbf{x}_{t} = \left[\mathbf{c}_{t−1},\mathbf{f}_{t−1}, \mathbf{c}_{t}\right] $$

$$ \mathbf{u}_{t} = \sigma\left(\mathbf{R}_{u}\mathbf{h}_{t-1} + \mathbf{I}^{*}_{u}\mathbf{x}_{t}\right) $$

$$ \mathbf{r}_{t} = \sigma\left(\mathbf{R}_{r}\mathbf{h}_{t-1} + \mathbf{I}^{*}_{r}\mathbf{x}_{t}\right) $$

$$ \mathbf{e}_{t} = \tau\left(\mathbf{r}_{t} \odot \left(\mathbf{R}_{e}\mathbf{h}_{t-1}\right) + \mathbf{I}^{*}_{e}\mathbf{x}_{t} \right) $$

$$ \mathbf{h}_{t} = \mathbf{u}_{t} \cdot \mathbf{h}_{t-1} + \left(1-\mathbf{u}_{t}\right) \cdot \mathbf{e}_{t} $$

$$ \mathbf{y}_{c}, \mathbf{y}_{f} = \text{split}\left(\mathbf{h}_{t}\right) $$

$$ P\left(\mathbf{c}_{t}\right) = \text{softmax}\left(\mathbf{O}_{2}\text{relu}\left(\mathbf{O}_{1}\mathbf{y}_{c}\right)\right) $$

$$ P\left(\mathbf{f}_{t}\right) = \text{softmax}\left(\mathbf{O}_{4}\text{relu}\left(\mathbf{O}_{3}\mathbf{y}_{f}\right)\right) $$

where the $*$ indicates a masked matrix whereby the last coarse input $\mathbf{c}_{t}$ is only connected to the fine part of the states $\mathbf{u}_{t}$, $\mathbf{r}_{t}$, $\mathbf{e}_{t}$ and $\mathbf{h}_{t}$ and thus only affects the fine output $\mathbf{y}_{f}$. The coarse and fine parts $\mathbf{c}_{t}$ and $\mathbf{f}_{t}$ are encoded as scalars in $\left[0, 255\right]$ and scaled to the interval $\left[−1, 1\right]$. The matrix $\mathbf{R}$ formed from the matrices $\mathbf{R}_{u}$, $\mathbf{R}_{r}$, $\mathbf{R}_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\mathbf{u}_{t}$, $mathbf{r}_{t}$ and $\mathbf{e}_{t}$ (a variant of the GRU cell. $\sigma$ and $\tau$ are the standard sigmoid and tanh non-linearities.

Each part feeds into a softmax layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).

Source: Efficient Neural Audio Synthesis

Latest Papers

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge
Bichen WuQing HePeizhao ZhangThilo KoehlerKurt KeutzerPeter Vajda
Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis
Erica CooperXin WangYi ZhaoYusuke YasudaJunichi Yamagishi
Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
| Dipjyoti PaulMuhammed PV ShifasYannis PantazisYannis Stylianou
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions
| Dipjyoti PaulYannis PantazisYannis Stylianou
Audiovisual Speech Synthesis using Tacotron2
Ahmed Hussen AbdelazizAnushree Prasanna KumarChloe SeivwrightGabriele FanelliJustin BinderYannis StylianouSachin Kajarekar
End-To-End Speech Synthesis Applied to Brazilian Portuguese
| Edresson CasanovaArnaldo Candido JuniorChristopher ShulbyFrederico Santos de OliveiraJoão Paulo TeixeiraMoacir Antonelli PontiSandra Maria Aluisio
ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders
Yu GuXiang YinYonghui RaoYuan WanBenlai TangYang ZhangJitong ChenYuxuan WangZejun Ma
Towards Robust Neural Vocoding for Speech Generation: A Survey
Po-chun HsuChun-hsuan WangAndy T. LiuHung-yi Lee
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis
Junjie PanXiang YinZhiling ZhangShichao LiuYang ZhangZejun MaYuxuan Wang
DurIAN: Duration Informed Attention Network For Multimodal Synthesis
| Chengzhu YuHeng LuNa HuMeng YuChao WengKun XuPeng LiuDeyi TuoShiyin KangGuangzhi LeiDan SuDong Yu
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction
| Jean-Marc ValinJan Skoglund
Efficient Neural Audio Synthesis
| Nal KalchbrennerErich ElsenKaren SimonyanSeb NouryNorman CasagrandeEdward LockhartFlorian StimbergAaron van den OordSander DielemanKoray Kavukcuoglu


Speech Synthesis 10 66.67%
Text-To-Speech Synthesis 3 20.00%
Denoising 1 6.67%
Voice Conversion 1 6.67%