Discriminators

Multiple Random Window Discriminator

Introduced by Bińkowski et al. in High Fidelity Speech Synthesis with Adversarial Networks

Multiple Random Window Discriminator is a discriminator used for the GAN-TTS text-to-speech architecture. These discriminators operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways, and is obtained by taking a Cartesian product of two parameter spaces: (i) the size of the random windows fed into the discriminator; (ii) whether a discriminator is conditioned on linguistic and pitch features. For example, in the authors' best-performing model, they consider five window sizes (240, 480, 960, 1920, 3600 samples), which yields 10 discriminators in total.

Using random windows of different size, rather than the full generated sample, has a data augmentation effect and also reduces the computational complexity of RWDs. In the first layer of each discriminator, the MRWD reshapes (downsamples) the input raw waveform to a constant temporal dimension $\omega = 240$ by moving consecutive blocks of samples into the channel dimension, i.e. from $\left[\omega_{k}, 1\right]$ to $\left[\omega, k\right]$, where $k$ is the downsampling factor (e.g. $k = 8$ for input window size $1920$). This way, all the RWDs have the same architecture and similar computational complexity despite different window sizes.

The conditional discriminators have access to linguistic and pitch features, and can measure whether the generated audio matches the input conditioning. This means that random windows in conditional discriminators need to be aligned with the conditioning frequency to preserve the correspondence between the waveform and linguistic features within the sampled window. This limits the valid sampling to that of the frequency of the conditioning signal (200Hz, or every 5ms). The unconditional discriminators, on the contrary, only evaluate whether the generated audio sounds realistic regardless of the conditioning. The random windows for these discriminators are sampled without constraints at full 24kHz frequency, which further increases the amount of training data.

For the architecture, the discriminators consists of blocks (DBlocks) that are similar to the GBlocks used in the generator, but without batch normalisation. Unconditional RWDs are composed entirely of DBlocks. In conditional RWDs, the input waveform is gradually downsampled by DBlocks, until the temporal dimension of the activation is equal to that of the conditioning, at which point a conditional DBlock is used. This joint information is then passed to the remaining DBlocks, whose final output is average-pooled to obtain a scalar. The dilation factors in the DBlocks’ convolutions follow the pattern 1, 2, 1, 2 – unlike the generator, the discriminator operates on a relatively small window, and the authors did not observe any benefit from using larger dilation factors.

Source: High Fidelity Speech Synthesis with Adversarial Networks

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Speech Synthesis 2 100.00%

Categories