Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition

ICASSP 2020 · Bagus Tris Atmaja, Masato Akagi ·

Due to its ability to accurately predict emotional states using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper pro- poses two methods to predict emotional attributes from audio and visual data using multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve the recognition rate. Second, a multistage fusion is proposed to combine results from various modalities’ final prediction. Our approach used multitask learning, employed at unimodal and early fusion methods, shows improvement over single-task learning with an average CCC score of 0.431 compared to 0.297. A multistage method, employed at the late fusion approach, significantly improved the agreement score between true and pre- dictated values on the development set of data (from [0.537, 0.565, 0.083] to [0.68, 0.656, 0.443]) for arousal, valence, and liking.

PDF