Multimodal Deep Learning

66 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Deep Learning

Trend	Dataset	Best Model	Paper	Code	Compare
	CUB-200-2011	Two Branch Network (Text - Bert + Image - Nts-Net)			See all

Datasets

Subtasks

Multimodal Text and Image Classification

Latest papers

Most implemented Social Latest No code

Formalizing Multimedia Recommendation through Multimodal Deep Learning

sisinflab/formal-multimod-rec • • 11 Sep 2023

Recommender systems (RSs) offer personalized navigation experiences on online platforms, but recommendation remains a challenging task, particularly in specific scenarios and domains.

11 Sep 2023

Paper
Code

Multimodal Foundation Models For Echocardiogram Interpretation

echonet/echo_CLIP • • 29 Aug 2023

Multimodal deep learning foundation models can learn the relationship between images and text.

29 Aug 2023

Paper
Code

On the Adversarial Robustness of Multi-Modal Foundation Models

chs20/robustvlm • • 21 Aug 2023

In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e. g. by guiding them to malicious websites or broadcast fake information.

21 Aug 2023

Paper
Code

MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning

pliang279/MultiBench • • 28 Jun 2023

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data.

429

28 Jun 2023

Paper
Code

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

claws-lab/multimodal-robustness-xmai • • 19 Jun 2023

The robustness of multimodal deep learning models to realistic changes in the input text is critical for their applicability to important tasks such as text-to-image retrieval and cross-modal entailment.

19 Jun 2023

Paper
Code

Towards Balanced Active Learning for Multimodal Classification

MengShen0709/bmmal • • 14 Jun 2023

Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality.

14 Jun 2023

Paper
Code

Multimodal Neural Databases

giovannitra/multimodalneuraldatabases • • 2 May 2023

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them.

02 May 2023

Paper
Code

Building Multimodal AI Chatbots

minniie/multimodal_chat • • 21 Apr 2023

Therefore, this work proposes a complete chatbot system using two multimodal deep learning models: an image retriever that understands texts and a response generator that understands images.

21 Apr 2023

Paper
Code

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Lightning-AI/lit-llama • • 28 Mar 2023

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.

5,801

28 Mar 2023

Paper
Code

Zorro: the masked multimodal transformer

lucidrains/zorro-pytorch • • 23 Jan 2023

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.

23 Jan 2023

Paper
Code

Multimodal Deep Learning

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result