1 code implementation • LREC 2022 • Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia
The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi. org/10. 5281/zenodo. 5034604 under a Creative Commons licence.
Ranked #1 on Multimodal Text Prediction on MultiSubs
Multimodal Lexical Translation Multimodal Text Prediction +2
no code implementations • EMNLP (IWSLT) 2019 • Zixiu Wu, Ozan Caglayan, Julia Ive, Josiah Wang, Lucia Specia
Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 16 Oct 2019 • Ozan Caglayan, Zixiu Wu, Pranava Madhyastha, Josiah Wang, Lucia Specia
This paper describes the Imperial College London team's submission to the 2019' VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features.
1 code implementation • ICCV 2019 • Josiah Wang, Lucia Specia
Localizing phrases in images is an important part of image understanding and can be useful in many applications that require mappings between textual and visual information.
no code implementations • 5 Aug 2019 • Zixiu Wu, Julia Ive, Josiah Wang, Pranava Madhyastha, Lucia Specia
The question we ask ourselves is whether visual features can support the translation process, in particular, given that this is a dataset extracted from videos, we focus on the translation of actions, which we believe are poorly captured in current static image-text datasets currently used for multimodal translation.
no code implementations • ACL 2019 • Pranava Madhyastha, Josiah Wang, Lucia Specia
It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description.
1 code implementation • WS 2018 • Pranava Swaroop Madhyastha, Josiah Wang, Lucia Specia
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn {`}distributional similarity{'} in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space.
no code implementations • 11 Sep 2018 • Pranava Madhyastha, Josiah Wang, Lucia Specia
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space.
1 code implementation • NAACL 2018 • Pranava Madhyastha, Josiah Wang, Lucia Specia
We address the task of detecting foiled image captions, i. e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described.
no code implementations • NAACL 2018 • Josiah Wang, Pranava Madhyastha, Lucia Specia
The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding.
no code implementations • 9 Jan 2018 • Yu-Xing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao, Emmanuel Dellandrea, Robert Gaizauskas, Liming Chen
This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations.
1 code implementation • ICLR 2018 • Pranava Madhyastha, Josiah Wang, Lucia Specia
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space.
no code implementations • CVPR 2016 • Yu-Xing Tang, Josiah Wang, Boyang Gao, Emmanuel Dellandrea, Robert Gaizauskas, Liming Chen
This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations.
no code implementations • LREC 2016 • Josiah Wang, Robert Gaizauskas
The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems.