AudioCaps
20 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in AudioCaps
Most implemented papers
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Audio captioning aims to generate text descriptions of audio clips.
Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates
For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator.
Accommodating Audio Modality in CLIP for Multimodal Processing
In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Target Sound Extraction with Variable Cross-modality Clues
Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.
Prefix tuning for automated audio captioning
Audio captioning aims to generate text descriptions from environmental sounds.
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks.
RECAP: Retrieval-Augmented Audio Captioning
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore.
Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Diffusion models power a vast majority of text-to-audio (TTA) generation methods.
Weakly-supervised Automated Audio Captioning via text only training
Our approach leverages the similarity between audio and text embeddings in CLAP.
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model.