wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

20 Jun 2020Alexei BaevskiHenry ZhouAbdelrahman MohamedMichael Auli

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Speech Recognition Libri-Light test-clean Large-10h-LV-60k Word Error Rate (WER) 2.6 # 1
Speech Recognition Libri-Light test-other Large-10h-LV-60k Word Error Rate (WER) 5.2 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet