TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio captioning	AudioCaps	EnCLAP-large	CIDEr	0.8029	# 2
Audio captioning	AudioCaps	EnCLAP-large	SPIDEr	0.4954	# 1
Audio captioning	AudioCaps	EnCLAP-large	SPICE	0.1879	# 1
Audio captioning	AudioCaps	EnCLAP-large	METEOR	0.2554	# 1
Audio captioning	AudioCaps	EnCLAP-base	CIDEr	0.7795	# 4
Audio captioning	AudioCaps	EnCLAP-base	SPIDEr	0.4829	# 3
Audio captioning	AudioCaps	EnCLAP-base	SPICE	0.1863	# 2
Audio captioning	AudioCaps	EnCLAP-base	METEOR	0.2473	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/enclap-combining-neural-audio-codec-and-audio/audio-captioning-on-audiocaps)](https://paperswithcode.com/sota/audio-captioning-on-audiocaps?p=enclap-combining-neural-audio-codec-and-audio)`

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

31 Jan 2024 · Jaeyeon Kim, JaeYoon Jung, Jinjoo Lee, Sang Hoon Woo ·

We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .

PDF Abstract

Code

Add Remove Mark official

jaeyeonkim99/enclap official

↳ Quickstart in

Spaces

Tasks

Add Remove

AudioCaps

Audio captioning

Language Modelling

Datasets

AudioCaps

Clotho WavCaps

Results from the Paper

Add Remove

Ranked #1 on Audio captioning on AudioCaps

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio captioning	AudioCaps	EnCLAP-large	CIDEr	0.8029	# 2	Compare
			SPIDEr	0.4954	# 1	Compare
			SPICE	0.1879	# 1	Compare
			METEOR	0.2554	# 1	Compare
Audio captioning	AudioCaps	EnCLAP-base	CIDEr	0.7795	# 4	Compare
			SPIDEr	0.4829	# 3	Compare
			SPICE	0.1863	# 2	Compare
			METEOR	0.2473	# 3	Compare

Methods

Add Remove

Adam • BART • BPE • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax

Edit Social Preview

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove