Audio captioning

40 papers with code • 2 benchmarks • 4 datasets

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio captioning

Trend	Dataset	Best Model	Paper	Code	Compare
	AudioCaps	EnCLAP-large			See all
	Clotho	Ensemble			See all

Libraries

Use these libraries to find Audio captioning models and implementations

richermans/AudioCaption

2 papers

Datasets

Subtasks

Zero-shot Audio Captioning

Most implemented papers

Most implemented Social Latest No code

Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach

JanBerg1/AAC-LwF • • 16 Jul 2021

In our scenario, a pre-optimized AAC method is used for some unseen general audio signals and can update its parameters in order to adapt to the new information, given a new reference caption.

Paper
Code

Audio Captioning Transformer

XinhaoMei/ACT • • 21 Jul 2021

In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.

Paper
Code

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

XinhaoMei/DCASE2021_task6_v2 • • 5 Aug 2021

Automated audio captioning aims to use natural language to describe the content of audio data.

Paper
Code

Can Audio Captions Be Evaluated with Image Caption Metrics?

blmoistawinde/fense • • 10 Oct 2021

Current metrics are found in poor correlation with human annotations on these datasets.

Paper
Code

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

felixgontier/dcase2021aac • • DCASE workshop 2021

utomated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language.

Paper
Code

Audio Retrieval with Natural Language Queries: A Benchmark Study

akoepke/audio-retrieval-benchmark • • 17 Dec 2021

Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.

Paper
Code

Local Information Assisted Attention-free Decoder for Audio Captioning

littleflyingsheep/p-localaft • • 10 Jan 2022

Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an audio signal, leading to inaccurate prediction of captions.

Paper
Code

Caption Feature Space Regularization for Audio Captioning

pris-cv/caption-feature-space-regularization • • 18 Apr 2022

To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.

Paper
Code