Vietnamese end-to-end speech recognition using wav2vec 2.0
Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15
PDFCode
Datasets
Results from the Paper
Ranked #1 on Speech Recognition on VIVOS (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Speech Recognition | Common Voice vi | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER | 11.52 | # 2 | ||
Speech Recognition | VIVOS | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER | 6.15 | # 1 |