TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Classification	AudioSet	BEATs (Audio-only, Single)	Test mAP	0.486	# 13
Audio Classification	AudioSet	BEATs (Audio-only, Ensemble)	Test mAP	0.506	# 5
Audio Classification	Balanced Audio Set	BEATs	Mean AP	38.9	# 1
Audio Classification	ESC-50	BEATs	Top-1 Accuracy	98.1	# 3
Audio Classification	ESC-50	BEATs	PRE-TRAINING DATASET	AudioSet	# 1
Audio Classification	ESC-50	BEATs	Accuracy (5-fold)	98.1	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beats-audio-pre-training-with-acoustic/audio-classification-on-balanced-audio-set)](https://paperswithcode.com/sota/audio-classification-on-balanced-audio-set?p=beats-audio-pre-training-with-acoustic)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beats-audio-pre-training-with-acoustic/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=beats-audio-pre-training-with-acoustic)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beats-audio-pre-training-with-acoustic/audio-classification-on-audioset)](https://paperswithcode.com/sota/audio-classification-on-audioset?p=beats-audio-pre-training-with-acoustic)`

BEATs: Audio Pre-Training with Acoustic Tokenizers

18 Dec 2022 · Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Furu Wei ·

The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

PDF Abstract

Code

Add Remove Mark official

microsoft/unilm official

18,340

Yui010206/CREMA

Tasks

Add Remove

Audio Classification

Self-Supervised Learning

Datasets

AudioSet

ESC-50

Results from the Paper

Edit

Ranked #1 on Audio Classification on Balanced Audio Set

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Classification	AudioSet	BEATs (Audio-only, Single)	Test mAP	0.486	# 13	Compare
Audio Classification	AudioSet	BEATs (Audio-only, Ensemble)	Test mAP	0.506	# 5	Compare
Audio Classification	Balanced Audio Set	BEATs	Mean AP	38.9	# 1	Compare
Audio Classification	ESC-50	BEATs	Top-1 Accuracy	98.1	# 3	Compare
			PRE-TRAINING DATASET	AudioSet	# 1	Compare
			Accuracy (5-fold)	98.1	# 3	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

BEATs: Audio Pre-Training with Acoustic Tokenizers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove