Traditional facial animation models do not particularly stress on the naturalness of imitating the dynamic facial expressions including eye movements, dynamic eyebrow deformation and head motion naturalness. Hence, in this paper, a speech-driven talking-face synthesizer (SDTS) is proposed for generating the dynamic talking video of given static face for semantically mimicking the speech of any real person. The SDTS can lead the static digital-twin face to vividly mimic the expressive motions of face and lip-synced mouth of various speakers with the personalized accent with high distinctiveness. The SDTS framework has two stages. In first stage, one branch termed the dynamic fused-features generation module (DFGM) contains cross-modal speech-facial fusion module (CSFF) and temporal convolutional network (TCN). The CSFF is the core to seamlessly align the speech features and facial features. The second branch is the self-designed adaptive identity extractor (AIE) where a series of the residual blocks using partial batch normalization unit (PBN-ResNet blocks) and the residual blocks with the squeeze-and-excitation unit (SE-ResNet blocks) are cascaded to precisely capture the key features of face in a static reference image. In the second stage of SDTS, the diffusion model termed diffusion-based rendering model (DIRM) is applied to generate the high-resolution video reconstruction with the consistencies of appearance and emotion via fusing the driving speech features and the referred facial features. The extensive experiments demonstrate that SDTS can significantly promote the lip-synchronization, enrich the upper facial expression, and exhibit the naturalness of the head movements. Moreover, the SDTS can steadily maintain the facial identity consistency and the facial expression coherence for varying speaking speeds and emotions. Hence, it can attain less than 5.26 FID, 0.72 LSE-D and 0.56 LME than the StyleTalk model which is a well-known talking-face synthesis model.
Wang et al. (Fri,) studied this question.