Speech recognition is the task of recognising speech within audio and converting it into text.
( Image credit: SpecAugment )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Speech translation has recently become an increasingly popular topic of research, partly due to the development of benchmark datasets.
Speech recognition is a well developed research field so that the current state of the art systems are being used in many applications in the software industry, yet as by today, there still does not exist such robust system for the recognition of words and sentences from singing voice.
In our experiments, we show that through alteration along different dimensions, the model learns to encode distinct aspects of speech.
The effectiveness of recurrent neural networks can be largely influenced by their ability to store into their dynamical memory information extracted from input sequences at different frequencies and timescales.
Sequence to Sequence models, in particular the Transformer, achieve state of the art results in Automatic Speech Recognition.
We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs.
We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.