TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speech Recognition	Europarl-ASR EN Guest-test	mllp_2021_streaming_verb	WER	7.3	# 2
Speech Recognition	Europarl-ASR EN Guest-test	mllp_2021_offline_verb	WER	7.0	# 1
Speech Recognition	Europarl-ASR EN MEP-test	mllp_2021_streaming_filt	WER	7.9	# 2
Speech Recognition	Europarl-ASR EN MEP-test	mllp_2021_offline_filt	WER	7.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/europarl-asr-a-large-corpus-of-parliamentary/speech-recognition-on-europarl-asr-en-guest)](https://paperswithcode.com/sota/speech-recognition-on-europarl-asr-en-guest?p=europarl-asr-a-large-corpus-of-parliamentary)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/europarl-asr-a-large-corpus-of-parliamentary/speech-recognition-on-europarl-asr-en-mep)](https://paperswithcode.com/sota/speech-recognition-on-europarl-asr-en-mep?p=europarl-asr-a-large-corpus-of-parliamentary)`

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Interspeech 2021 · Gonçal V. Garcés Díaz-Munío, Joan-Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez-González-de-Martos, Jorge Civera, Albert Sanchis, Alfons Juan ·

We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Benchmarking

Data Augmentation

Speech Recognition

Datasets

Introduced in the Paper:

Europarl-ASR

Used in the Paper:

Europarl Europarl-ST

Results from the Paper

Add Remove

Ranked #1 on Speech Recognition on Europarl-ASR EN MEP-test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speech Recognition	Europarl-ASR EN Guest-test	mllp_2021_streaming_verb	WER	7.3	# 2	Compare
Speech Recognition	Europarl-ASR EN Guest-test	mllp_2021_offline_verb	WER	7.0	# 1	Compare
Speech Recognition	Europarl-ASR EN MEP-test	mllp_2021_streaming_filt	WER	7.9	# 2	Compare
Speech Recognition	Europarl-ASR EN MEP-test	mllp_2021_offline_filt	WER	7.8	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove