Vietnamese end-to-end speech recognition using wav2vec 2.0

Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15

PDF

Results from the Paper


 Ranked #1 on Speech Recognition on VIVOS (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Speech Recognition Common Voice vi Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI Test WER 11.52 # 2
Speech Recognition VIVOS Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI Test WER 6.15 # 1

Methods


No methods listed for this paper. Add relevant methods here