Talking Face Generation
37 papers with code • 2 benchmarks • 6 datasets
Talking face generation aims to synthesize a sequence of face images that correspond to given speech semantics
( Image credit: Talking Face Generation by Adversarially Disentangled Audio-Visual Representation )
Latest papers
Deepfake Generation and Detection: A Benchmark and Survey
Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions, which has significant application potential in fields such as entertainment, movie production, digital human creation, to name a few.
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision.
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.
HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation
First, FaceEncoder is used to obtain latent code by extracting features from the visual face information taken from the video source containing the face frame. Then, HyperConv, which weighting parameters are updated by HyperNet with the audio features as input, will modify the latent code to synchronize the lip movement with the audio.
HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods
In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth.
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker.
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results.
Emotionally Enhanced Talking Face Generation
To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more realistic and convincing.
DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video
Different from previous works relying on multiple up-sample layers to directly generate pixels from latent embeddings, DINet performs spatial deformation on feature maps of reference images to better preserve high-frequency textural details.