M2-Mixer: A Multimodal Mixer with Multi-head Loss for Classification from Multimodal Data

IEEE Big Data 2024 · Grigor Bezirganyan, Sana Sellami, Laure Berti-ÉQuille, Sébastien Fournier ·

In this paper, we propose M2-Mixer, an MLP-Mixer based architecture with multi-head loss for multimodal classification. It achieves better performances than the convolutional, recurrent, or neural architecture search based baseline models with the main advantage of conceptual and computational simplicity. The proposed multi-head loss function addresses the problem of modality predominance (i.e., when one of the modalities is favored over the others by the training algorithm). Our experiments demonstrate that our multimodal mixer architecture, combined with the multi-head loss function, outperforms the baseline models on two benchmark multimodal datasets: AVMNIST and MIMIC-III with respectively, on average, + 0.43% in accuracy and 6. 4 times reduction in training time and + 0.33% in accuracy and 13. 3 times reduction in training time, compared with previous best performing models.

PDF Abstract