End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

19 Nov 2019Gabriel SynnaeveQiantong XuJacob KahnTatiana LikhomanenkoEdouard GraveVineel PratapAnuroop SriramVitaliy LiptchinskyRonan Collobert

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling... (read more)

PDF Abstract

Results from the Paper


#3 best model for Speech Recognition on LibriSpeech test-clean (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT LEADERBOARD
Speech Recognition LibriSpeech test-clean Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only) Word Error Rate (WER) 2.31 # 7
Speech Recognition LibriSpeech test-clean Conv + Transformer AM + Pseudo-Labeling (ConvLM with Transformer Rescoring) Word Error Rate (WER) 2.03 # 3
Speech Recognition LibriSpeech test-other Conv + Transformer AM (ConvLM with Transformer Rescoring) (LS only) Word Error Rate (WER) 5.18 # 6
Speech Recognition LibriSpeech test-other Conv + Transformer AM + Pseudo-Labeling (ConvLM with Transformer Rescoring) Word Error Rate (WER) 4.11 # 3

Methods used in the Paper