TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Speech Recognition	LibriSpeech test-clean	w2v-BERT XXL	Word Error Rate (WER)	1.4	# 1
Speech Recognition	LibriSpeech test-other	w2v-BERT XXL	Word Error Rate (WER)	2.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/w2v-bert-combining-contrastive-learning-and/speech-recognition-on-librispeech-test-clean)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean?p=w2v-bert-combining-contrastive-learning-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/w2v-bert-combining-contrastive-learning-and/speech-recognition-on-librispeech-test-other)](https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-other?p=w2v-bert-combining-contrastive-learning-and)`

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

7 Aug 2021 · Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu ·

Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/fairseq

29,277

wenet-e2e/wenet

↳ Quickstart in

Spaces

3,697

RoganInglis/AudioLM

Tasks

Add Remove

Contrastive Learning

Language Modelling

Masked Language Modeling

Representation Learning

Speech Recognition

Datasets

LibriSpeech Libri-Light

Results from the Paper

Edit

Ranked #1 on Speech Recognition on LibriSpeech test-clean (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Result	Benchmark
Speech Recognition	LibriSpeech test-clean	w2v-BERT XXL	Word Error Rate (WER)	1.4	# 1			Compare
Speech Recognition	LibriSpeech test-other	w2v-BERT XXL	Word Error Rate (WER)	2.5	# 2			Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove