Multimodal Machine Translation
34 papers with code • 3 benchmarks • 5 datasets
Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.
( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )
Libraries
Use these libraries to find Multimodal Machine Translation models and implementationsLatest papers
Seamless: Multilingual Expressive and Streaming Speech Translation
In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
Video-Helpful Multimodal Machine Translation
In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation.
Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs
This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete.
Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation
Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation.
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation
Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability.
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation
We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation.
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs.
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.
Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation
Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.
VALHALLA: Visual Hallucination for Machine Translation
In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation.