Hierarchical Pre-training for Sequence Labelling in Spoken Dialog

Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (\texttt{SILICONE}). \texttt{SILICONE} is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over $2.3$ billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.

PDF Abstract Findings of 2020 PDF Findings of 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Emotion Recognition in Conversation DailyDialog Pretrained Hierarchical Transformer Micro-F1 60.14 # 6
Dialogue Act Classification ICSI Meeting Recorder Dialog Act (MRDA) corpus Pretrained Hierarchical Transformer Accuracy 92.4 # 1
Emotion Recognition in Conversation IEMOCAP Pretrained Hierarchical Transformer Weighted-F1 65.37 # 36
Accuracy 66.05 # 20
Emotion Recognition in Conversation MELD Pretrained Hierarchical Transformer Weighted-F1 61.90 # 42
Emotion Recognition in Conversation SEMAINE Pretrained Hierarchical Transformer MAE (Valence) 0.16 # 2
MAE (Arousal) 0.16 # 1
MAE (Expectancy) 0.16 # 1
MAE (Power) 7.70 # 2
Text Classification SILICONE Benchmark Pretrained Hierarchical Transformer 1:1 Accuracy 71.25 # 1
Dialogue Act Classification Switchboard corpus Pretrained Hierarchical Transformer Accuracy 79.2 # 7

Methods


No methods listed for this paper. Add relevant methods here