June 28, 2021

Multimodal Learning for Temporally Coherent Talking Face Generation With Articulator Synergy

Key Points

Key points are not available for this paper at this time.

Abstract

Talking face generation is a demanding task to synthesize a high quality video with accurate lip synchronization and rhythmic head motion. Any subtle artifacts could be sensitively captured by humans and lead to poor visual quality. Existing methods tend to employ a conditional generation solution, which introduces facial landmarks to bridge the input information and output videos. However, these methods always suffer from unrealistic facial animations, because 1) they only take single-mode input, but ignore the complementarity of multimodal inputs for lip-sync improvement; 2) they only explore lip movements, but ignore the articulator synergy between lips and jaw; 3) they generate each video frame in a temporal-independent way, but ignore the temporal continuity among the entire video. To address these limitations, in this paper, we present a novel method to generate realistic and temporally coherent talking heads by considering multimodal inputs, articulator synergy, inter-frame consistency and intra-frame consistency. Firstly, for landmark prediction, a novel Multiple Synergy Network (MSN) is proposed to improve the accuracy of landmark prediction by incorporating multimodal inputs (i.e., audio and text inputs). Besides, instead of merely considering lip landmarks, we also explore the jaw movements to ensure articulator synergy among lips and jaw. Secondly, for realistic video generation, a Video Consistency Network (VCN) is proposed conditioned on the predicted landmarks. In VCN, the optical flow is adopted to model the temporal continuity between frames to ensure inter-frame consistency. Meanwhile, a mouth generation branch is proposed to enhance mouth texture and the corresponding mouth mask is employed to ensure intra-frame consistency between the mouth area and the others. Extensive experiments demonstrate that our approach exhibits excellent superiority on lip-sync and can generate photo-realistic facial animations. Project is available at http://imcc.ustc.edu.cn/project/tfgen/.

Bookmark

Multimodal Learning for Temporally Coherent Talking Face Generation With Articulator Synergy

Key Points

Abstract

Cite This Study