End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network

25 Apr 2022  ·  Avi Gazneli, Gadi Zimerman, Tal Ridnik, Gilad Sharir, Asaf Noy ·

While efficient architectures and a plethora of augmentations for end-to-end image classification tasks have been suggested and heavily investigated, state-of-the-art techniques for audio classifications still rely on numerous representations of the audio signal together with large architectures, fine-tuned from large datasets. By utilizing the inherited lightweight nature of audio and novel audio augmentations, we were able to present an efficient end-to-end network with strong generalization ability. Experiments on a variety of sound classification sets demonstrate the effectiveness and robustness of our approach, by achieving state-of-the-art results in various settings. Public code is available at: \href{https://github.com/Alibaba-MIIL/AudioClassfication}{this http url}

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Audio Classification AudioSet EAT-M Test mAP 0.426 # 32
Audio Classification AudioSet EAT-S Test mAP 0.405 # 34
Audio Classification ESC-50 EAT-S Top-1 Accuracy 95.25 # 10
PRE-TRAINING DATASET AudioSet # 1
Accuracy (5-fold) 95.25 # 10
Audio Classification ESC-50 EAT-S (scratch) Top-1 Accuracy 92.15 # 12
Accuracy (5-fold) 92.15 # 12
Audio Classification ESC-50 EAT-M Top-1 Accuracy 96.3 # 7
PRE-TRAINING DATASET AudioSet # 1
Accuracy (5-fold) 96.3 # 7
Keyword Spotting Google Speech Commands EAT-S Google Speech Commands V2 35 98.15 # 3
Environmental Sound Classification UrbanSound8K EAT-S Accuracy (10-fold) 88.1 # 3
Environmental Sound Classification UrbanSound8K EAT-S (scratch) Accuracy (10-fold) 85.5 # 4
Environmental Sound Classification UrbanSound8K EAT-M Accuracy (10-fold) 90 # 2

Methods


No methods listed for this paper. Add relevant methods here