What question did this study set out to answer?

The aim is to develop a speech-driven talking-face synthesizer that mimics realistic facial dynamics and expressions.

April 8, 2026

Precise Speech-driven Talking-Face Synthesis with Realistic Speaker-Emulated Facil Expressions

Key Points

The aim is to develop a speech-driven talking-face synthesizer that mimics realistic facial dynamics and expressions.
Developed a two-stage framework for face animation synthesis.
First stage uses a dynamic fused-features generation module and cross-modal speech-facial fusion.
Second stage implements a diffusion-based rendering model for high-resolution video output.
Utilized advanced neural network components like temporal convolutional networks and residual blocks.
Achieved significant improvements in lip-synchronization and upper facial expression dynamics.
Maintained facial identity consistency across different emotions and speaking speeds.
Outperformed the StyleTalk model with lower FID, LSE-D, and LME metrics, indicating enhanced quality.

Abstract

Traditional facial animation models do not particularly stress on the naturalness of imitating the dynamic facial expressions including eye movements, dynamic eyebrow deformation and head motion naturalness. Hence, in this paper, a speech-driven talking-face synthesizer (SDTS) is proposed for generating the dynamic talking video of given static face for semantically mimicking the speech of any real person. The SDTS can lead the static digital-twin face to vividly mimic the expressive motions of face and lip-synced mouth of various speakers with the personalized accent with high distinctiveness. The SDTS framework has two stages. In first stage, one branch termed the dynamic fused-features generation module (DFGM) contains cross-modal speech-facial fusion module (CSFF) and temporal convolutional network (TCN). The CSFF is the core to seamlessly align the speech features and facial features. The second branch is the self-designed adaptive identity extractor (AIE) where a series of the residual blocks using partial batch normalization unit (PBN-ResNet blocks) and the residual blocks with the squeeze-and-excitation unit (SE-ResNet blocks) are cascaded to precisely capture the key features of face in a static reference image. In the second stage of SDTS, the diffusion model termed diffusion-based rendering model (DIRM) is applied to generate the high-resolution video reconstruction with the consistencies of appearance and emotion via fusing the driving speech features and the referred facial features. The extensive experiments demonstrate that SDTS can significantly promote the lip-synchronization, enrich the upper facial expression, and exhibit the naturalness of the head movements. Moreover, the SDTS can steadily maintain the facial identity consistency and the facial expression coherence for varying speaking speeds and emotions. Hence, it can attain less than 5.26 FID, 0.72 LSE-D and 0.56 LME than the StyleTalk model which is a well-known talking-face synthesis model.

Mark Helpful

Bookmark

Relay