Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

8 Dec 2015Dario AmodeiRishita AnubhaiEric BattenbergCarl CaseJared CasperBryan CatanzaroJingdong ChenMike ChrzanowskiAdam CoatesGreg DiamosErich ElsenJesse EngelLinxi FanChristopher FougnerTony HanAwni HannunBilly JunPatrick LeGresleyLibby LinSharan NarangAndrew NgSherjil OzairRyan PrengerJonathan RaimanSanjeev SatheeshDavid SeetapunShubho SenguptaYi WangZhiqian WangChong WangBo XiaoDani YogatamaJun ZhanZhenyao Zhu

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Speech Recognition on WSJ eval93 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Noisy Speech Recognition CHiME clean Deep Speech 2 Percentage error 3.34 # 1
Noisy Speech Recognition CHiME real Deep Speech 2 Percentage error 21.79 # 3
Speech Recognition LibriSpeech test-clean Deep Speech 2 Word Error Rate (WER) 5.33 # 20
Speech Recognition LibriSpeech test-other Deep Speech 2 Word Error Rate (WER) 13.25 # 16
Accented Speech Recognition VoxForge American-Canadian Deep Speech 2 Percentage error 7.55 # 1
Accented Speech Recognition VoxForge Commonwealth Deep Speech 2 Percentage error 13.56 # 1
Accented Speech Recognition VoxForge European Deep Speech 2 Percentage error 17.55 # 1
Accented Speech Recognition VoxForge Indian Deep Speech 2 Percentage error 22.44 # 1
Speech Recognition WSJ eval92 Deep Speech 2 Percentage error 3.60 # 4
Speech Recognition WSJ eval93 Deep Speech 2 Percentage error 4.98 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet