Multimodal Deep Learning

66 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Deep Learning

Trend	Dataset	Best Model	Paper	Code	Compare
	CUB-200-2011	Two Branch Network (Text - Bert + Image - Nts-Net)			See all

Datasets

Subtasks

Multimodal Text and Image Classification

Most implemented papers

Most implemented Social Latest No code

DeepSeek-VL: Towards Real-World Vision-Language Understanding

deepseek-ai/deepseek-vl • • 8 Mar 2024

The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.

Paper
Code

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification

catalina17/XFlow • • 2 Sep 2017

Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.

Paper
Code

Learn to Combine Modalities in Multimodal Deep Learning

skywaLKer518/MultiplicativeMultimodal • • 29 May 2018

Combining complementary information from multiple modalities is intuitively appealing for improving the performance of learning-based approaches.

Paper
Code

Multimodal Age and Gender Classification Using Ear and Profile Face Images

iremeyiokur/multipie_extended_ear_dataset • 23 Jul 2019

Experimental results indicated that profile face images contain a rich source of information for age and gender classification.

Paper
Code

Audio-Conditioned U-Net for Position Estimation in Full Sheet Images

CPJKU/audio_conditioned_unet • • 16 Oct 2019

The goal of score following is to track a musical performance, usually in the form of audio, in a corresponding score representation.

Paper
Code

Predicting the Leading Political Ideology of YouTube Channels Using Acoustic, Textual, and Metadata Information

yoandinkov/interspeech-2019 • 20 Oct 2019

Our analysis shows that the use of acoustic signal helped to improve bias detection by more than 6% absolute over using text and metadata only.

Paper
Code

Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response

firojalam/multimodal_social_media • 14 Apr 2020

Multimedia content in social media platforms provides significant information during disaster events.

Paper
Code

HYDRA: A multimodal deep learning framework for malware classification

danielgibert/mlw_classification_hydra • • 12 May 2020

While traditional machine learning methods for malware detection largely depend on hand-designed features, which are based on experts’ knowledge of the domain, end-to-end learning approaches take the raw executable as input, and try to learn a set of descriptive features from it.

Paper
Code

Image Search With Text Feedback by Visiolinguistic Attention Learning

yanbeic/VAL • • CVPR 2020

In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework.

Paper
Code

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

danfenghong/IEEE_TGRS_MDL-RS • • 12 Aug 2020

In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications.

Paper
Code

Multimodal Deep Learning

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result