no code implementations • 23 Aug 2023 • Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro
We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh.
1 code implementation • ICCV 2023 • Jeongsoo Choi, Joanna Hong, Yong Man Ro
In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.
1 code implementation • CVPR 2023 • Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.
3 code implementations • 17 Feb 2023 • Minsu Kim, Joanna Hong, Yong Man Ro
To this end, we design multi-task learning that guides the model using multimodal supervision, i. e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.
no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.
1 code implementation • 13 Jul 2022 • Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.
no code implementations • 15 Jun 2022 • Joanna Hong, Minsu Kim, Yong Man Ro
Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject.
1 code implementation • NeurIPS 2021 • Minsu Kim, Joanna Hong, Yong Man Ro
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis.
1 code implementation • ICCV 2021 • Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro
By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.
Ranked #3 on Lipreading on CAS-VSR-W1k (LRW-1000)
1 code implementation • IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021 • Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro
Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.
no code implementations • 16 Jul 2020 • Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro
Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units.