Multimodal Deep Learning

66 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Deep Learning

Trend	Dataset	Best Model	Paper	Code	Compare
	CUB-200-2011	Two Branch Network (Text - Bert + Image - Nts-Net)			See all

Datasets

Subtasks

Multimodal Text and Image Classification

Most implemented papers

Most implemented Social Latest No code

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

opengvlab/llama-adapter • • 28 Mar 2023

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.

Paper
Code

ShapeWorld - A new test methodology for multimodal language understanding

AlexKuhnle/ShapeWorld • • 14 Apr 2017

We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities.

Paper
Code

Multimodal deep networks for text and image-based document classification

Quicksign/ocrized-text-dataset • 15 Jul 2019

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures.

Paper
Code

Are These Birds Similar: Learning Branched Networks for Fine-grained Representations

nicolalandro/ntsnet-cub200 • • 16 Jan 2020

In recent years, natural language descriptions are used to obtain information on discriminative parts of the object.

Paper
Code

aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception

aimotive/aimotive_dataset • • 17 Nov 2022

The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view.

Paper
Code

Multimodal Deep Learning for Robust RGB-D Object Recognition

isrugeek/robotics_recognition • • 24 Jul 2015

Robust object recognition is a crucial ingredient of many, if not all, real-world robotics applications.

Paper
Code

Supervised Video Summarization via Multiple Feature Sets with Parallel Attention

TIBHannover/MSVA • • 23 Apr 2021

The proposed architecture utilizes an attention mechanism before fusing motion features and features representing the (static) visual content, i. e., derived from an image classification model.

Paper
Code

A multimodal deep learning framework for scalable content based visual media retrieval

ambareeshravi/media_retrieval • 18 May 2021

We propose a novel, efficient, modular and scalable framework for content based visual media retrieval systems by leveraging the power of Deep Learning which is flexible to work both for images and videos conjointly and we also introduce an efficient comparison and filtering metric for retrieval.

Paper
Code

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

declare-lab/multimodal-deep-learning • • 28 Jul 2021

Multimodal sentiment analysis aims to extract and integrate semantic information collected from multiple modalities to recognize the expressed emotions and sentiment in multimodal data.

Paper
Code

HyMNet: a Multimodal Deep Learning System for Hypertension Classification using Fundus Photographs and Cardiometabolic Risk Factors

mohammedsb/hymnet • • 2 Oct 2023

Our MMDL system uses RETFound, a foundation model pre-trained on 1. 6 million retinal images, for the fundus path and a fully connected neural network for the age and gender path.

Paper
Code

Multimodal Deep Learning

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result