multimodal generation

21 papers with code • 1 benchmarks • 2 datasets

Multimodal generation refers to the process of generating outputs that incorporate multiple modalities, such as images, text, and sound. This can be done using deep learning models that are trained on data that includes multiple modalities, allowing the models to generate output that is informed by more than one type of data.

For example, a multimodal generation model could be trained to generate captions for images that incorporate both text and visual information. The model could learn to identify objects in the image and generate descriptions of them in natural language, while also taking into account contextual information and the relationships between the objects in the image.

Multimodal generation can also be used in other applications, such as generating realistic images from textual descriptions or generating audio descriptions of video content. By combining multiple modalities in this way, multimodal generation models can produce more accurate and comprehensive output, making them useful for a wide range of applications.

Benchmarks

Add a Result

These leaderboards are used to track progress in multimodal generation

Trend	Dataset	Best Model	Paper	Code	Compare
	Multi-Modal CelebA-HQ	Diffusion			See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Finite Scalar Quantization: VQ-VAE Made Simple

google-research/google-research • • 27 Sep 2023

Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets.

Paper
Code

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)

mchong6/GANsNRoses • • 11 Jun 2021

This adversarial loss guarantees the map is diverse -- a very wide range of anime can be produced from a single content code.

Paper
Code

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

zzw-zwzhang/awesome-of-multimodal-dialogue-models • • 10 Mar 2023

This stream is subsequently fed into the decoder-based transformer to generate visual re-creations and textual feedback in the second stage.

Paper
Code

Retrieval-Augmented Generation for AI-Generated Content: A Survey

hymie122/rag-survey • 29 Feb 2024

We first classify RAG foundations according to how the retriever augments the generator, distilling the fundamental abstractions of the augmentation methodologies for various retrievers and generators.

Paper
Code

Continual and Multi-Task Architecture Search

ramakanth-pasunuru/CAS-MAS • • ACL 2019

Architecture search is the process of automatically learning the neural model or cell structure that best suits the given task.

Paper
Code

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

ttumyche/mxq-vae • • 15 Apr 2022

To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy.

Paper
Code

Multimodal Generation of Novel Action Appearances for Synthetic-to-Real Recognition of Activities of Daily Living

zrrr1997/syn2real_dg • 3 Aug 2022

We tackle this challenge and introduce an activity domain generation framework which creates novel ADL appearances (novel domains) from different existing activity modalities (source domains) inferred from video training data.

Paper
Code

Multimedia Generative Script Learning for Task Planning

EagleW/Multimedia-Generative-Script-Learning-for-Task-Planning • • 25 Aug 2022

Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities.

Paper
Code

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

mhh0318/unid3 • • 27 Nov 2022

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals.

Paper
Code

Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models

Nithin-GK/UniteandConquer • • CVPR 2023

We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints.

Paper
Code

multimodal generation

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result