Zero-Shot Environment Sound Classification

5 papers with code • 1 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-Shot Environment Sound Classification

Trend	Dataset	Best Model	Paper	Code	Compare
	ESC-50	WavCaps			See all

Datasets

ESC-50

Most implemented papers

Most implemented Social Latest No code

AudioCLIP: Extending CLIP to Image, Text and Audio

AndreyGuzhov/AudioCLIP • • 24 Jun 2021

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90. 07% on the UrbanSound8K and 97. 15% on the ESC-50 datasets.

Paper
Code

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

pku-yuangroup/languagebind • • 3 Oct 2023

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

Paper
Code

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps • • 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

Paper
Code

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

OFA-Sys/ONE-PEACE • • 18 May 2023

In this work, we explore a scalable way for building a general representation model toward unlimited modalities.

Paper
Code

ImageBind: One Embedding Space To Bind Them All

facebookresearch/imagebind • • CVPR 2023

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.

Paper
Code

Zero-Shot Environment Sound Classification

Benchmarks Add a Result

Datasets

Most implemented papers

AudioCLIP: Extending CLIP to Image, Text and Audio

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

ImageBind: One Embedding Space To Bind Them All

Content

Benchmarks

Add a Result