MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations

29 May 2023  ·  Calum Heggan, Tim Hospedales, Sam Budgett, Mehrdad Yaghoobi ·

Contrastive self-supervised learning has gained attention for its ability to create high-quality representations from large unlabelled data sets. A key reason that these powerful features enable data-efficient learning of downstream tasks is that they provide augmentation invariance, which is often a useful inductive bias. However, the amount and type of invariances preferred is not known apriori, and varies across different downstream tasks. We therefore propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner. Our multi-task representation provides a strong and flexible feature that benefits diverse downstream tasks. We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance on all of them

PDF Abstract

Results from the Paper


 Ranked #1 on Few-Shot Audio Classification on Common Voice (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Few-Shot Audio Classification BirdClef 2020 (Pruned) MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 29.49±0.38 # 9
Few-Shot Audio Classification BirdClef 2020 (Pruned) SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 30.93±0.38 # 8
Few-Shot Audio Classification BirdClef 2020 (Pruned) Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 21.04±0.35 # 10
Few-Shot Audio Classification Common Voice MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 35.22±0.40 # 1
Few-Shot Audio Classification Common Voice Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 23.00±0.42 # 3
Few-Shot Audio Classification Common Voice SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 33.33±0.38 # 2
Few-Shot Audio Classification CREMA-D SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 29.10±0.36 # 2
Few-Shot Audio Classification CREMA-D MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 29.61±0.38 # 1
Few-Shot Audio Classification CREMA-D Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 21.68±0.33 # 3
Few-Shot Audio Classification ESC-50 SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 63.40±0.39 # 8
Few-Shot Audio Classification ESC-50 MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 69.53±0.39 # 4
Few-Shot Audio Classification ESC-50 Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 37.76±0.34 # 10
Few-Shot Audio Classification FSDKaggle2018 Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 21.72±0.34 # 10
Few-Shot Audio Classification FSDKaggle2018 MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 39.11±0.41 # 6
Few-Shot Audio Classification FSDKaggle2018 SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 37.64±0.40 # 8
Few-Shot Audio Classification NSynth Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 62.52±0.36 # 10
Few-Shot Audio Classification NSynth MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 71.81±0.39 # 6
Few-Shot Audio Classification NSynth SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 66.44±0.40 # 8
Few-Shot Audio Classification Speech Accent Archive SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 26.16±0.34 # 2
Few-Shot Audio Classification Speech Accent Archive Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 23.08±0.34 # 3
Few-Shot Audio Classification Speech Accent Archive MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 28.92±0.37 # 1
Few-Shot Audio Classification Speech Command v2 SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 25.68±0.35 # 1
Few-Shot Audio Classification Speech Command v2 Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 20.08±0.37 # 3
Few-Shot Audio Classification Speech Command v2 MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 23.65±0.34 # 2
Few-Shot Audio Classification VoxCeleb1 Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 21.68±0.40 # 10
Few-Shot Audio Classification VoxCeleb1 MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 33.58±0.39 # 6
Few-Shot Audio Classification VoxCeleb1 SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 31.18±0.37 # 7
Few-Shot Audio Classification Watkins Marine Mammal Sounds MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 59.49±0.42 # 1
Few-Shot Audio Classification Watkins Marine Mammal Sounds SimCLR (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 52.91±0.41 # 3
Few-Shot Audio Classification Watkins Marine Mammal Sounds Multi-Label Augmentation Prediction (FSD50K, RN18) Top-1 Accuracy(5-Way-1-Shot) 28.88±0.39 # 5

Methods