M2-Mixer: A Multimodal Mixer with Multi-head Loss for Classification from Multimodal Data

In this paper, we propose M2-Mixer, an MLP-Mixer based architecture with multi-head loss for multimodal classification. It achieves better performances than the convolutional, recurrent, or neural architecture search based baseline models with the main advantage of conceptual and computational simplicity. The proposed multi-head loss function addresses the problem of modality predominance (i.e., when one of the modalities is favored over the others by the training algorithm). Our experiments demonstrate that our multimodal mixer architecture, combined with the multi-head loss function, outperforms the baseline models on two benchmark multimodal datasets: AVMNIST and MIMIC-III with respectively, on average, + 0.43% in accuracy and 6. 4 times reduction in training time and + 0.33% in accuracy and 13. 3 times reduction in training time, compared with previous best performing models.

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods