Audio Generation
60 papers with code • 3 benchmarks • 8 datasets
Audio generation (synthesis) is the task of generating raw audio such as speech.
( Image credit: MelNet )
Latest papers
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models.
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls
We apply this method to fine-tune MusicGen, a leading autoregressive music generation model.
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment.
Speech collage: code-switched audio generation by collaging monolingual corpora
Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources.
Invisible Watermarking for Audio Generation Diffusion Models
Diffusion models have gained prominence in the image domain for their capabilities in data generation and transformation, achieving state-of-the-art performance in various tasks in both image and audio domains.
Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Diffusion models power a vast majority of text-to-audio (TTA) generation methods.
An Initial Exploration: Learning to Generate Realistic Audio for Silent Video
Generating realistic audio effects for movies and other media is a challenging task that is accomplished today primarily through physical techniques known as Foley art.
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies
Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation.