
20 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Most implemented papers

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

haoheliu/AudioLDM 29 Jan 2023

By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Audio Retrieval with Natural Language Queries

oncescuandreea/audio-retrieval 5 May 2021

We consider the task of retrieving audio using free-form natural language queries.

Audio Captioning Transformer

XinhaoMei/ACT 21 Jul 2021

In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.

Can Audio Captions Be Evaluated with Image Caption Metrics?

blmoistawinde/fense 10 Oct 2021

Current metrics are found in poor correlation with human annotations on these datasets.


felixgontier/dcase2021aac DCASE workshop 2021

utomated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language.

Audio Retrieval with Natural Language Queries: A Benchmark Study

akoepke/audio-retrieval-benchmark 17 Dec 2021

Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.

Separate What You Describe: Language-Queried Audio Source Separation

liuxubo717/lass 28 Mar 2022

In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").

On Metric Learning for Audio-Text Cross-Modal Retrieval

XinhaoMei/audio-text_retrieval 29 Mar 2022

We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.

Audio Retrieval with WavText5K and CLAP Training

microsoft/wavtext5k 28 Sep 2022

In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval.