Hierarchical Fusion for Online Multimodal Dialog Act Classification

We propose a framework for online multimodal dialog act (DA) classification based on raw audio and ASR-generated transcriptions of current and past utterances. Existing multimodal DA classification approaches are limited by ineffective audio modeling and late-stage fusion. We showcase significant improvements in multimodal DA classification by integrating modalities at a more granular level and incorporating recent advancements in large language and audio models for audio feature extraction. We further investigate the effectiveness of self-attention and cross-attention mechanisms in modeling utterances and dialogs for DA classification. We achieve a substantial increase of 3 percentage points in the F1 score relative to current state-of-the-art models on two prominent DA classification datasets, MRDA and EMOTyDA.

PDF

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Dialogue Act Classification EMOTyDA Hierarchical Fusion Accuracy 63.42 # 1
Dialogue Act Classification ICSI Meeting Recorder Dialog Act (MRDA) corpus Hierarchical Fusion Accuracy 91.8 # 2

Methods


No methods listed for this paper. Add relevant methods here