Audio Generation
64 papers with code • 3 benchmarks • 9 datasets
Audio generation (synthesis) is the task of generating raw audio such as speech.
( Image credit: MelNet )
Datasets
Latest papers with no code
Bass Accompaniment Generation via Latent Diffusion
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem.
EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks
The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns.
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation.
Masked Audio Generation using a Single Non-Autoregressive Transformer
We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens.
Efficient Parallel Audio Generation using Group Masked Language Modeling
We present a fast and high-quality codec language model for parallel audio generation.
Audiobox: Unified Audio Generation with Natural Language Prompts
Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data.
Diffusion-EXR: Controllable Review Generation for Explainable Recommendation via Diffusion Models
Denoising Diffusion Probabilistic Model (DDPM) has shown great competence in image and audio generation tasks.
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio.
SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement
This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE).
tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models
Contrastive Language-Audio Pretraining (CLAP) became of crucial importance in the field of audio and speech processing.