no code implementations • 18 Jan 2024 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro
By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.
1 code implementation • 5 Dec 2023 • Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.
no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.
no code implementations • ICCV 2023 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.
1 code implementation • ICCV 2023 • Jeongsoo Choi, Joanna Hong, Yong Man Ro
In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.
no code implementations • 15 Aug 2023 • Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements.
1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).
no code implementations • 28 Jun 2023 • Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.
1 code implementation • 31 May 2023 • Jeongsoo Choi, Minsu Kim, Yong Man Ro
Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.
no code implementations • 31 May 2023 • Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.
1 code implementation • CVPR 2023 • Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.
no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.