Multimodal Deep Learning

67 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Zorro: the masked multimodal transformer

lucidrains/zorro-pytorch 23 Jan 2023

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.

92
23 Jan 2023

Multimodal Deep Learning

slds-lmu/seminar_multimodal_dl 12 Jan 2023

This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually.

158
12 Jan 2023

Learning Semantic Relationship Among Instances for Image-Text Matching

CrossmodalGroup/HREM CVPR 2023

Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities.

79
01 Jan 2023

Learning Multimodal Data Augmentation in Feature Space

lzcemma/lemda 29 Dec 2022

The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems.

36
29 Dec 2022

Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis

firasgit/lsmt 18 Dec 2022

Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data.

11
18 Dec 2022

aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception

aimotive/aimotive_dataset 17 Nov 2022

The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view.

43
17 Nov 2022

Bayesian Prompt Learning for Image-Language Model Generalization

saic-fi/bayesian-prompt-learning ICCV 2023

Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.

21
05 Oct 2022

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

1adrianb/lasp CVPR 2023

Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.

10
03 Oct 2022

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

550
20 Sep 2022

LAVIS: A Library for Language-Vision Intelligence

salesforce/lavis 15 Sep 2022

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.

8,907
15 Sep 2022