AudioCaps

20 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

liuxubo717/v-act 28 Oct 2022

Audio captioning aims to generate text descriptions of audio clips.

Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates

labbeti/aac-metrics 14 Nov 2022

For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator.

Accommodating Audio Modality in CLIP for Multimodal Processing

ludanruan/clip4vla 12 Mar 2023

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

Target Sound Extraction with Variable Cross-modality Clues

lichenda/multi-clue-tse-data 15 Mar 2023

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources.

Prefix tuning for automated audio captioning

MinkyuKim26/Prefix_AAC_ICASSP2023 30 Mar 2023

Audio captioning aims to generate text descriptions from environmental sounds.

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

declare-lab/tango 24 Apr 2023

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks.

RECAP: Retrieval-Augmented Audio Captioning

sreyan88/recap 18 Sep 2023

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore.

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Bai-YT/ConsistencyTTA 19 Sep 2023

Diffusion models power a vast majority of text-to-audio (TTA) generation methods.

Weakly-supervised Automated Audio Captioning via text only training

zelaki/wsac 21 Sep 2023

Our approach leverages the similarity between audio and text embeddings in CLAP.

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

jaeyeonkim99/enclap 31 Jan 2024

We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model.