Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis
Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.
PDF AbstractResults from the Paper
Ranked #1 on Multimodal Sentiment Analysis on CMU-MOSEI (Acc-7 metric)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Multimodal Sentiment Analysis | CH-SIMS | ALMT | F1 | 81.57 | # 2 | |
MAE | 0.404 | # 2 | ||||
CORR | 0.619 | # 1 | ||||
Acc-5 | 45.73 | # 1 | ||||
Acc-3 | 68.93 | # 1 | ||||
Acc-2 | 81.19 | # 1 | ||||
Multimodal Sentiment Analysis | CMU-MOSEI | ALMT | MAE | 0.526 | # 3 | |
Acc-7 | 54.28 | # 1 | ||||
Acc-5 | 55.96 | # 1 | ||||
Corr | 0.779 | # 1 | ||||
Multimodal Sentiment Analysis | CMU-MOSI | ALMT | MAE | 0.683 | # 2 | |
Corr | 0.805 | # 3 | ||||
Acc-7 | 49.42 | # 1 | ||||
Acc-5 | 56.41 | # 1 |