Multimodal Emotion Recognition with Transformer-Based Self Supervised Feature Fusion

Emotion Recognition is a challenging research area given its complex nature, and humans express emotional cues across various modalities such as language, facial expressions, and speech. Representation and fusion of features are the most crucial tasks in multimodal emotion recognition research. Self Supervised Learning (SSL) has become a prominent and influential research direction in representation learning, where researchers have access to pre-trained SSL models that represent different data modalities. For the first time in the literature, we represent three input modalities of text, audio (speech), and vision with features extracted from independently pre-trained SSL models in this paper. Given the high dimensional nature of SSL features, we introduce a novel Transformers and Attention-based fusion mechanism that can combine multimodal SSL features and achieve state-of-the-art results for the task of multimodal emotion recognition. We benchmark and evaluate our work to show that our model is robust and outperforms the state-of-the-art models on four datasets.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods