Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark

Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.

PDF Abstract

Datasets


Introduced in the Paper:

Jam-ALT

Used in the Paper:

Jamendo Lyrics
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Automatic Lyrics Transcription Jam-ALT HTDemucs + Whisper v3 Word Error Rate (WER) 47.9 # 5
Case Error Rate 3.8 # 2
Punctuation F1 29.0 # 4
Line break F1 65.7 # 4
Automatic Lyrics Transcription Jam-ALT AudioShake Word Error Rate (WER) 26.0 # 1
Case Error Rate 3.4 # 1
Punctuation F1 50.5 # 1
Line break F1 82.3 # 1
Section break F1 72.1 # 1
Parenthesis F-1 29.4 # 1
Automatic Lyrics Transcription Jam-ALT Whisper v2 Word Error Rate (WER) 35.7 # 3
Case Error Rate 4.5 # 4
Punctuation F1 41.7 # 2
Line break F1 69.3 # 3
Section break F1 3.3 # 2
Automatic Lyrics Transcription Jam-ALT Whisper v3 Word Error Rate (WER) 35.5 # 2
Case Error Rate 4.3 # 3
Punctuation F1 41.6 # 3
Line break F1 73.5 # 2
Section break F1 1.0 # 3
Automatic Lyrics Transcription Jam-ALT HTDemucs + Whisper v2 Word Error Rate (WER) 44.0 # 4
Case Error Rate 5.3 # 5
Punctuation F1 28.0 # 5
Line break F1 61.2 # 5
Automatic Lyrics Transcription Jam-ALT English HTDemucs + Whisper v3 Word Error Rate (WER) 43.0 # 5
Case Error Rate 4.1 # 4
Punctuation F-1 23.3 # 6
Line break F-1 66.8 # 4
Automatic Lyrics Transcription Jam-ALT English HTDemucs + Whisper v2 Word Error Rate (WER) 32.3 # 3
Case Error Rate 5.3 # 6
Punctuation F-1 39.2 # 3
Line break F-1 53.8 # 6
Automatic Lyrics Transcription Jam-ALT English Whisper v2 Word Error Rate (WER) 43.8 # 6
Case Error Rate 3.5 # 2
Punctuation F-1 31.3 # 5
Line break F-1 63.0 # 5
Section break F-1 11.2 # 2
Automatic Lyrics Transcription Jam-ALT English LyricWhiz Word Error Rate (WER) 24.6 # 2
Case Error Rate 3.5 # 2
Punctuation F-1 34.0 # 4
Line break F-1 74.0 # 2
Section break F-1 1.4 # 4
Automatic Lyrics Transcription Jam-ALT English AudioShake Word Error Rate (WER) 22.1 # 1
Case Error Rate 3.4 # 1
Punctuation F-1 59.0 # 1
Parenthesis F-1 32.4 # 1
Line break F-1 80.7 # 1
Section break F-1 77.4 # 1
Automatic Lyrics Transcription Jam-ALT English Whisper v3 Word Error Rate (WER) 37.7 # 4
Case Error Rate 4.8 # 5
Punctuation F-1 40.9 # 2
Line break F-1 71.5 # 3
Section break F-1 2.6 # 3
Automatic Lyrics Transcription Jam-ALT French Whisper v2 Word Error Rate (WER) 27.7 # 1
Case Error Rate 3.2 # 2
Punctuation F-1 45.8 # 1
Line break F-1 73.4 # 3
Section break F-1 1.4 # 2
Automatic Lyrics Transcription Jam-ALT French HTDemucs + Whisper v3 Word Error Rate (WER) 44.9 # 5
Case Error Rate 3.2 # 2
Punctuation F-1 30.9 # 5
Line break F-1 69.4 # 4
Automatic Lyrics Transcription Jam-ALT French Whisper v3 Word Error Rate (WER) 34.7 # 2
Case Error Rate 3.3 # 5
Punctuation F-1 42.4 # 3
Line break F-1 77.8 # 2
Automatic Lyrics Transcription Jam-ALT French AudioShake Word Error Rate (WER) 34.9 # 3
Case Error Rate 2.0 # 1
Punctuation F-1 45.8 # 1
Parenthesis F-1 41.3 # 1
Line break F-1 84.9 # 1
Section break F-1 72.5 # 1
Automatic Lyrics Transcription Jam-ALT French HTDemucs + Whisper v2 Word Error Rate (WER) 43.3 # 4
Case Error Rate 3.2 # 2
Punctuation F-1 34.9 # 4
Line break F-1 66.1 # 5
Automatic Lyrics Transcription Jam-ALT German Whisper v2 Word Error Rate (WER) 45.4 # 4
Case Error Rate 5.3 # 4
Punctuation F-1 38.7 # 3
Line break F-1 69.9 # 4
Automatic Lyrics Transcription Jam-ALT German AudioShake Word Error Rate (WER) 24.4 # 1
Case Error Rate 4.1 # 2
Punctuation F-1 48.5 # 1
Parenthesis F-1 8.1 # 1
Line break F-1 81.2 # 1
Section break F-1 69.2 # 1
Automatic Lyrics Transcription Jam-ALT German Whisper v3 Word Error Rate (WER) 40.7 # 2
Case Error Rate 4.0 # 1
Punctuation F-1 41.2 # 2
Line break F-1 71.2 # 3
Section break F-1 1.2 # 2
Automatic Lyrics Transcription Jam-ALT German HTDemucs + Whisper v3 Word Error Rate (WER) 43.5 # 3
Case Error Rate 4.4 # 3
Punctuation F-1 34.0 # 4
Line break F-1 72.0 # 2
Automatic Lyrics Transcription Jam-ALT German HTDemucs + Whisper v2 Word Error Rate (WER) 65.2 # 5
Case Error Rate 5.9 # 5
Punctuation F-1 30.2 # 5
Line break F-1 67.5 # 5
Automatic Lyrics Transcription Jam-ALT Spanish Whisper v2 Word Error Rate (WER) 25.7 # 2
Case Error Rate 6.5 # 4
Punctuation F-1 50.0 # 1
Line break F-1 71.7 # 3
Section break F-1 3.1 # 2
Automatic Lyrics Transcription Jam-ALT Spanish AudioShake Word Error Rate (WER) 22.5 # 1
Case Error Rate 4.1 # 2
Punctuation F-1 47.8 # 2
Parenthesis F-1 38.0 # 1
Line break F-1 82.7 # 1
Section break F-1 69.6 # 1
Automatic Lyrics Transcription Jam-ALT Spanish HTDemucs + Whisper v2 Word Error Rate (WER) 38.8 # 4
Case Error Rate 7.1 # 5
Punctuation F-1 17.2 # 5
Line break F-1 56.4 # 4
Automatic Lyrics Transcription Jam-ALT Spanish HTDemucs + Whisper v3 Word Error Rate (WER) 61.5 # 5
Case Error Rate 3.6 # 1
Punctuation F-1 28.7 # 4
Line break F-1 52.4 # 5
Automatic Lyrics Transcription Jam-ALT Spanish Whisper v3 Word Error Rate (WER) 28.6 # 3
Case Error Rate 5.0 # 3
Punctuation F-1 41.9 # 3
Line break F-1 73.7 # 2

Methods