Multimodal Deep Learning

67 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Deep Learning

Trend	Dataset	Best Model	Paper	Code	Compare
	CUB-200-2011	Two Branch Network (Text - Bert + Image - Nts-Net)			See all

Datasets

Subtasks

Multimodal Text and Image Classification

Latest papers

Most implemented Social Latest No code

Zorro: the masked multimodal transformer

lucidrains/zorro-pytorch • • 23 Jan 2023

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.

23 Jan 2023

Paper
Code

Multimodal Deep Learning

slds-lmu/seminar_multimodal_dl • 12 Jan 2023

This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually.

158

12 Jan 2023

Paper
Code

Learning Semantic Relationship Among Instances for Image-Text Matching

CrossmodalGroup/HREM • • CVPR 2023

Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities.

01 Jan 2023

Paper
Code

Learning Multimodal Data Augmentation in Feature Space

lzcemma/lemda • • 29 Dec 2022

The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems.

29 Dec 2022

Paper
Code

Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis

firasgit/lsmt • • 18 Dec 2022

Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data.

18 Dec 2022

Paper
Code

aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception

aimotive/aimotive_dataset • • 17 Nov 2022

The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view.

17 Nov 2022

Paper
Code

Bayesian Prompt Learning for Image-Language Model Generalization

saic-fi/bayesian-prompt-learning • • ICCV 2023

Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.

05 Oct 2022

Paper
Code

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

1adrianb/lasp • • CVPR 2023

Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.

03 Oct 2022

Paper
Code

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA • • 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

550

20 Sep 2022

Paper
Code

LAVIS: A Library for Language-Vision Intelligence

salesforce/lavis • • 15 Sep 2022

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.

8,907

15 Sep 2022

Paper
Code

Multimodal Deep Learning

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result