What question did this study set out to answer?

The research aims to improve lip synchronization and visual coherence in talking face generation using neural radiance fields.

February 11, 2026Open Access

Enhancing Audio–Visual Synchronization and Spatiotemporal Expressiveness for Talking Face Generation

TWTao WenShanghai University HLHengjie LuShanghai University YGYuan GaoShanghai University

Key Points

The research aims to improve lip synchronization and visual coherence in talking face generation using neural radiance fields.
Developed a NeRF-based model to enhance audio-visual synchronization.
Incorporated audio event features to reduce background noise impact.
Implemented a feedback mechanism for stabilizing lip movements across frames.
Introduced facial depth supervision to improve training efficiency and spatial consistency.
Achieved state-of-the-art lip synchronization accuracy compared to existing methods.
Demonstrated improved spatiotemporal stability in lip movements.
Enhanced overall visual fidelity of generated talking faces.

Abstract

Talking face generation aims to produce high-fidelity, temporally coherent videos of speakers with synchronized lip movements aligned to input audio. neural radiance fields (NeRF) are widely adopted due to their realistic modeling capabilities. However, existing NeRF-based approaches face several challenges. First, background noise often disrupts lip synchronization, making it difficult to align lip movements accurately with audio signals, especially when training data are temporally constrained. Furthermore, these methods suffer from spatiotemporal inconsistency, which manifests in two ways: temporally, unreliable audio signals lead to flickering lip movements, undermining coherence; spatially, the lack of facial structure constraints reduces realism and hinders training efficiency. To address these issues, we propose a NeRF-based method that enhances audio–visual synchronization and SpatioTemporal expressiveness for talking face generation (AVIST). Specifically, we enhance the saliency of human speech in audio using audio event features, effectively suppressing background noise interference during training and inference to improve lip-sync accuracy. Additionally, we introduce a feedback mechanism that incorporates lip features from preceding frames to stabilize current lip movements, mitigating temporal instability. Finally, we integrate facial depth supervision to expedite network training and enhance spatial consistency, resulting in more realistic face rendering. Extensive experiments on mainstream datasets demonstrate that AVIST achieves state-of-the-art performance in lip synchronization, spatiotemporal stability, and overall visual fidelity.

Ask AI

Helpful

Bookmark

View Full Paper