Multimodal Machine Translation

34 papers with code • 3 benchmarks • 5 datasets

Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.

( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )

Libraries

Use these libraries to find Multimodal Machine Translation models and implementations

Latest papers with no code

Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets

no code yet • 9 Apr 2024

Recent research in the field of multimodal machine translation (MMT) has indicated that the visual modality is either dispensable or offers only marginal advantages.

Detecting Concrete Visual Tokens for Multimodal Machine Translation

no code yet • 5 Mar 2024

The challenge of visual grounding and masking in multimodal machine translation (MMT) systems has encouraged varying approaches to the detection and selection of visually-grounded text tokens for masking.

Adding Multimodal Capabilities to a Text-only Translation Model

no code yet • 5 Mar 2024

While most current work in multimodal machine translation (MMT) uses the Multi30k dataset for training and evaluation, we find that the resulting models overfit to the Multi30k dataset to an extreme degree.

The Case for Evaluating Multimodal Translation Models on Text Datasets

no code yet • 5 Mar 2024

Therefore, we propose that MMT models be evaluated using 1) the CoMMuTE evaluation framework, which measures the use of visual information by MMT models, 2) the text-only WMT news translation task test sets, which evaluates translation performance against complex sentences, and 3) the Multi30k test sets, for measuring MMT model performance against a real MMT dataset.

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

no code yet • 12 Jun 2023

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets.

HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

no code yet • 28 May 2023

This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language.

Iterative Adversarial Attack on Image-guided Story Ending Generation

no code yet • 16 May 2023

Multimodal learning involves developing models that can integrate information from various sources like images and texts.

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

no code yet • 16 Feb 2023

In this paper, a multimodal pre-training generalization algorithm for self-supervised training is proposed, which overcomes the lack of visual information and inaccuracy, and thus extends the applicability of images on NMT.

Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation

no code yet • 20 Dec 2022

Therefore, this paper correspondingly establishes new methods and new datasets for MMT.

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

no code yet • 9 Nov 2022

Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance.