Multimodal Deep Learning

66 papers with code • 1 benchmarks • 17 datasets

Multimodal deep learning is a type of deep learning that combines information from multiple modalities, such as text, image, audio, and video, to make more accurate and comprehensive predictions. It involves training deep neural networks on data that includes multiple types of information and using the network to make predictions based on this combined data.

One of the key challenges in multimodal deep learning is how to effectively combine information from multiple modalities. This can be done using a variety of techniques, such as fusing the features extracted from each modality, or using attention mechanisms to weight the contribution of each modality based on its importance for the task at hand.

Multimodal deep learning has many applications, including image captioning, speech recognition, natural language processing, and autonomous vehicles. By combining information from multiple modalities, multimodal deep learning can improve the accuracy and robustness of models, enabling them to perform better in real-world scenarios where multiple types of information are present.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Deep Learning

Trend	Dataset	Best Model	Paper	Code	Compare
	CUB-200-2011	Two Branch Network (Text - Bert + Image - Nts-Net)			See all

Datasets

Subtasks

Multimodal Text and Image Classification

Latest papers

Most implemented Social Latest No code

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

minhquan6203/vitextvqa-dataset • 16 Apr 2024

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images.

16 Apr 2024

Paper
Code

MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

songrise/mope • 14 Mar 2024

Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness.

14 Mar 2024

Paper
Code

Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach

cissyduan/mmrm • • 11 Mar 2024

Cultural heritage serves as the enduring record of human thought and history.

11 Mar 2024

Paper
Code

DeepSeek-VL: Towards Real-World Vision-Language Understanding

deepseek-ai/deepseek-vl • • 8 Mar 2024

The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.

1,466

08 Mar 2024

Paper
Code

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

chs20/robustvlm • • 19 Feb 2024

The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e. g. LLaVA and OpenFlamingo.

19 Feb 2024

Paper
Code

InstructIR: High-Quality Image Restoration Following Human Instructions

mv-lab/InstructIR • • 29 Jan 2024

All-In-One image restoration models can effectively restore images from various types and levels of degradation using degradation-specific information as prompts to guide the restoration model.

376

29 Jan 2024

Paper
Code

Uncertainty-Aware Hardware Trojan Detection Using Multimodal Deep Learning

cars-lab-repo/noodle • • 15 Jan 2024

The risk of hardware Trojans being inserted at various stages of chip production has increased in a zero-trust fabless era.

15 Jan 2024

Paper
Code

Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge

bowen-upenn/scene_graph_commonsense • • 21 Nov 2023

This work presents an enhanced approach to generating scene graphs by incorporating a relationship hierarchy and commonsense knowledge.

21 Nov 2023

Paper
Code

Dynamic Task and Weight Prioritization Curriculum Learning for Multimodal Imagery

fualsan/datwep • • 29 Oct 2023

Our primary objective is to develop a curriculum-trained multimodal deep learning model, with a particular focus on visual question answering (VQA) capable of jointly processing image and text data, in conjunction with semantic segmentation for disaster analytics using the FloodNet\footnote{https://github. com/BinaLab/FloodNet-Challenge-EARTHVISION2021} dataset.

29 Oct 2023

Paper
Code

HyMNet: a Multimodal Deep Learning System for Hypertension Classification using Fundus Photographs and Cardiometabolic Risk Factors

mohammedsb/hymnet • • 2 Oct 2023

Our MMDL system uses RETFound, a foundation model pre-trained on 1. 6 million retinal images, for the fundus path and a fully connected neural network for the age and gender path.

02 Oct 2023

Paper
Code

Multimodal Deep Learning

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result