Multimodal Deep Learning
67 papers with code • 1 benchmarks • 17 datasets
Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.
One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.
Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.
Datasets
Latest papers
Zorro: the masked multimodal transformer
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.
Multimodal Deep Learning
This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually.
Learning Semantic Relationship Among Instances for Image-Text Matching
Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities.
Learning Multimodal Data Augmentation in Feature Space
The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems.
Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data.
aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception
The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view.
Bayesian Prompt Learning for Image-Language Model Generalization
Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts.
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models
Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.
LAVIS: A Library for Language-Vision Intelligence
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.