10 papers with code • 1 benchmarks • 2 datasets
Talking face generation aims to synthesize a sequence of face images that correspond to given speech semantics.
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic.
While speech content information can be defined by learning the intrinsic synchronization between audio-visual modalities, we identify that a pose code will be complementarily learned in a modulated convolution-based reconstruction framework.
Indeed, just having the ability to generate a single talking face would make a system almost robotic in nature.
However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio.
Ranked #1 on Unconstrained Lip-synchronization on LRW
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions.
Given an arbitrary face image and an arbitrary speech clip, the proposed work attempts to generating the talking face video with accurate lip synchronization while maintaining smooth transition of both lip and facial movement over the entire video clip.