TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Generation	AudioCaps	Auffusion	FAD	1.63	# 5
Audio Generation	AudioCaps	Auffusion	FD	21.99	# 6
Audio Generation	AudioCaps	Auffusion-Full	FAD	1.76	# 6
Audio Generation	AudioCaps	Auffusion-Full	FD	23.08	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/auffusion-leveraging-the-power-of-diffusion/audio-generation-on-audiocaps)](https://paperswithcode.com/sota/audio-generation-on-audiocaps?p=auffusion-leveraging-the-power-of-diffusion)`

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

2 Jan 2024 · Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li ·

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.

PDF Abstract

Code

Add Remove Mark official

happylittlecat2333/Auffusion official

↳ Quickstart in

Colab

115

Tasks

Add Remove

Audio Generation

Style Transfer

Datasets

AudioCaps

Clotho

Results from the Paper

Add Remove

Ranked #5 on Audio Generation on AudioCaps

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Generation	AudioCaps	Auffusion	FAD	1.63	# 5	Compare
Audio Generation	AudioCaps	Auffusion	FD	21.99	# 6	Compare
Audio Generation	AudioCaps	Auffusion-Full	FAD	1.76	# 6	Compare
Audio Generation	AudioCaps	Auffusion-Full	FD	23.08	# 8	Compare

Methods

Add Remove

Diffusion • Inpainting

Edit Social Preview

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove