Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos

Emotion recognition in conversations is crucial for the development of empathetic machines. Present methods mostly ignore the role of inter-speaker dependency relations while classifying emotions in conversations. In this paper, we address recognizing utterance-level emotions in dyadic conversational videos. We propose a deep neural framework, termed Conversational Memory Network (CMN), which leverages contextual information from the conversation history. In particular, CMN uses multimodal approach comprising audio, visual and textual features with gated recurrent units to model past utterances of each speaker into memories. These memories are then merged using attention-based hops to capture inter-speaker dependencies. Experiments show a significant improvement of 3 − 4{\%} in accuracy over the state of the art.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Emotion Recognition in Conversation IEMOCAP CMN Weighted-F1 56.19 # 45
Accuracy 56.32 # 25
Macro-F1 54.84 # 5

Methods