We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Noisy Speech Recognition CHiME clean Deep Speech 2 Percentage error 3.34 # 1
Noisy Speech Recognition CHiME real Deep Speech 2 Percentage error 21.79 # 4
Speech Recognition LibriSpeech test-clean Deep Speech 2 Word Error Rate (WER) 5.33 # 48
Speech Recognition LibriSpeech test-other Deep Speech 2 Word Error Rate (WER) 13.25 # 44
Accented Speech Recognition VoxForge American-Canadian Deep Speech 2 Percentage error 7.55 # 1
Accented Speech Recognition VoxForge Commonwealth Deep Speech 2 Percentage error 13.56 # 1
Accented Speech Recognition VoxForge European Deep Speech 2 Percentage error 17.55 # 1
Accented Speech Recognition VoxForge Indian Deep Speech 2 Percentage error 22.44 # 1
Speech Recognition WSJ eval92 Deep Speech 2 Word Error Rate (WER) 3.60 # 11
Speech Recognition WSJ eval93 Deep Speech 2 Word Error Rate (WER) 4.98 # 1

Methods


No methods listed for this paper. Add relevant methods here