Multimodal Machine Translation
34 papers with code • 3 benchmarks • 5 datasets
Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.
( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )
Libraries
Use these libraries to find Multimodal Machine Translation models and implementationsLatest papers
Neural Machine Translation with Phrase-Level Universal Visual Representations
Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs.
On Vision Features in Multimodal Machine Translation
Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models.
MSCTD: A Multimodal Sentiment Chat Translation Dataset
In this work, we introduce a new task named Multimodal Chat Translation (MCT), aiming to generate more accurate translations with the help of the associated dialogue history and visual context.
VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation
Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations.
Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models
Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.
BERTGEN: Multi-task Generation through BERT
We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively.
ViTA: Visual-Linguistic Translation by Aligning Object Tags
Multimodal Machine Translation (MMT) enriches the source text with visual information for translation.
Cultural and Geographical Influences on Image Translatability of Words across Languages
We find that images of words are not always invariant across languages, and that language pairs with shared culture, meaning having either a common language family, ethnicity or religion, have improved image translatability (i. e., have more similar images for similar words) compared to its converse, regardless of their geographic proximity.
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Pre-trained language models have been shown to improve performance in many natural language tasks substantially.
Dynamic Context-guided Capsule Network for Multimodal Machine Translation
Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.