Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

9 Oct 2023  ·  Haoyu Zhang, Yu Wang, Guanghao Yin, Kejun Liu, Yuanyuan Liu, Tianshu Yu ·

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Multimodal Sentiment Analysis CH-SIMS ALMT F1 81.57 # 2
MAE 0.404 # 2
CORR 0.619 # 1
Acc-5 45.73 # 1
Acc-3 68.93 # 1
Acc-2 81.19 # 1
Multimodal Sentiment Analysis CMU-MOSEI ALMT MAE 0.526 # 3
Acc-7 54.28 # 1
Acc-5 55.96 # 1
Corr 0.779 # 1
Multimodal Sentiment Analysis CMU-MOSI ALMT MAE 0.683 # 2
Corr 0.805 # 3
Acc-7 49.42 # 1
Acc-5 56.41 # 1

Methods