Multiple Random Window Discriminator Explained

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Multiple Random Window Discriminator** is a discriminator used for the [GAN-TTS](https://paperswithcode.com/method/gan-tts) text-to-speech architecture. These discriminators operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways, and is obtained by taking
a Cartesian product of two parameter spaces: (i) the size of the random windows fed into the discriminator; (ii) whether a discriminator is conditioned on linguistic and pitch features. For example,
in the authors' best-performing model, they consider five window sizes (240, 480, 960, 1920, 3600 samples), which yields 10 discriminators in total.

Using random windows of different size, rather than the full generated sample, has a data augmentation effect and also reduces the computational complexity of RWDs. In the first layer of each discriminator, the MRWD reshapes (downsamples) the input raw waveform to a constant
temporal dimension $\omega = 240$ by moving consecutive blocks of samples into the channel dimension, i.e. from $\left[\omega\_{k}, 1\right]$ to $\left[\omega, k\right]$, where $k$ is the downsampling factor (e.g. $k = 8$ for input window size $1920$). This way, all the RWDs have the same architecture and similar computational complexity despite different window sizes.

The conditional discriminators have access to linguistic and pitch features, and can measure whether
the generated audio matches the input conditioning. This means that random windows in conditional
discriminators need to be aligned with the conditioning frequency to preserve the correspondence
between the waveform and linguistic features within the sampled window. This limits the valid sampling to that of the frequency of the conditioning signal (200Hz, or every 5ms). The unconditional
discriminators, on the contrary, only evaluate whether the generated audio sounds realistic regardless
of the conditioning. The random windows for these discriminators are sampled without constraints
at full 24kHz frequency, which further increases the amount of training data.

For the architecture, the discriminators consists of blocks (DBlocks) that are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without batch normalisation. Unconditional RWDs are composed entirely of DBlocks. In conditional RWDs, the input waveform is gradually downsampled by DBlocks, until the temporal dimension of the activation is equal to that of the conditioning, at which point a conditional [DBlock](https://paperswithcode.com/method/dblock) is used. This joint information is then passed to the remaining DBlocks, whose final output is average-pooled to obtain a scalar. The dilation factors in the DBlocks’ convolutions follow the pattern 1, 2, 1, 2 – unlike the generator, the discriminator operates on a relatively small window, and the authors did not observe any benefit from using larger dilation factors.

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2020-07-05_at_8.21.47_PM_a5bmVtD.png Clear
Change:

Attached collections:

DISCRIMINATORS

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Component	Type	Add Remove
Average Pooling	Pooling Operations
Conditional DBlock	Audio Model Blocks
DBlock	Audio Model Blocks

Multiple Random Window Discriminator

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove