TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Audio Tagging	AudioSet	DyMN-L (Audio-Only, Single)	mean average precision	0.490	# 4
Audio Classification	AudioSet	DyMN-L (Audio-Only, Single)	Test mAP	0.490	# 11
Audio Classification	ESC-50	DyMN-L	Top-1 Accuracy	97.4	# 5
Audio Classification	ESC-50	DyMN-L	PRE-TRAINING DATASET	AudioSet	# 1
Audio Classification	ESC-50	DyMN-L	Accuracy (5-fold)	97.4	# 5
Audio Classification	FSD50K	MN	mAP	65.6	# 2
Audio Classification	FSD50K	DyMN-L	mAP	65.5	# 4
Instrument Recognition	OpenMIC-2018	DyMN-L	mean average precision	0.855	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamic-convolutional-neural-networks-as/instrument-recognition-on-openmic-2018)](https://paperswithcode.com/sota/instrument-recognition-on-openmic-2018?p=dynamic-convolutional-neural-networks-as)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamic-convolutional-neural-networks-as/audio-classification-on-fsd50k)](https://paperswithcode.com/sota/audio-classification-on-fsd50k?p=dynamic-convolutional-neural-networks-as)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamic-convolutional-neural-networks-as/audio-tagging-on-audioset)](https://paperswithcode.com/sota/audio-tagging-on-audioset?p=dynamic-convolutional-neural-networks-as)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamic-convolutional-neural-networks-as/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=dynamic-convolutional-neural-networks-as)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dynamic-convolutional-neural-networks-as/audio-classification-on-audioset)](https://paperswithcode.com/sota/audio-classification-on-audioset?p=dynamic-convolutional-neural-networks-as)`

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

24 Oct 2023 · Florian Schmid, Khaled Koutini, Gerhard Widmer ·

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform Transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic convolutions and attention mechanisms. We show that these dynamic CNNs outperform traditional efficient CNNs, in terms of the performance-complexity trade-off and parameter efficiency, at the task of audio tagging on the large-scale AudioSet. Our experiments further indicate that the introduced dynamic CNNs achieve better performance on downstream tasks and scale up well, attaining Transformer performance and even outperforming them on AudioSet and several downstream tasks.

PDF Abstract

Code

Add Remove Mark official

fschmid56/efficientat official

181

Tasks

Add Remove

Audio Classification

Audio Tagging

Instrument Recognition

Knowledge Distillation

Datasets

ImageNet

AudioSet

ESC-50

FSD50K

OpenMIC-2018

Results from the Paper

Edit

Ranked #1 on Instrument Recognition on OpenMIC-2018 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Audio Tagging	AudioSet	DyMN-L (Audio-Only, Single)	mean average precision	0.490	# 4	Compare
Audio Classification	AudioSet	DyMN-L (Audio-Only, Single)	Test mAP	0.490	# 11	Compare
Audio Classification	ESC-50	DyMN-L	Top-1 Accuracy	97.4	# 5	Compare
			PRE-TRAINING DATASET	AudioSet	# 1	Compare
			Accuracy (5-fold)	97.4	# 5	Compare
Audio Classification	FSD50K	MN	mAP	65.6	# 2	Compare
Audio Classification	FSD50K	DyMN-L	mAP	65.5	# 4	Compare
Instrument Recognition	OpenMIC-2018	DyMN-L	mean average precision	0.855	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Knowledge Distillation • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove