Generative Audio Models


Introduced by Donahue et al. in Adversarial Audio Synthesis

SpecGAN is a generative adversarial network method for spectrogram-based, frequency-domain audio generation. The problem is suited for GANs designed for image generation. The model can be approximately inverted.

To process audio into suitable spectrograms, the authors perform the short-time Fourier transform with 16 ms windows and 8ms stride, resulting in 128 frequency bins, linearly spaced from 0 to 8 kHz. They take the magnitude of the resultant spectra and scale amplitude values logarithmically to better-align with human perception. They then normalize each frequency bin to have zero mean and unit variance. They clip the spectra to $3$ standard deviations and rescale to $\left[−1, 1\right]$.

They then use the DCGAN approach on the result spectra.

Source: Adversarial Audio Synthesis


Paper Code Results Date Stars


Task Papers Share
Audio Generation 1 50.00%
Image Generation 1 50.00%
