What type of study is this?

September 10, 2025Open Access

NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars

Key Points

NeRF-LipSync achieves synchronization accuracy of 9.06 in digital avatars, enhancing audio-visual synthesis.
On VoxCeleb2, NeRF-LipSync reaches a PSNR of 18.32, indicating strong perceptual realism in lip synchronization.
Methodology incorporates diffusion modeling and NeRF alignment, enhancing view consistency and identity preservation.
Results demonstrate generalization ability across speakers and highlight the importance of temporal attention in maintaining motion stability.

Abstract

Abstract. Achieving natural, accurate, and identity-preserving lip synchronization in talking avatars is a fundamental problem in audio-visual synthesis. Existing methods often struggle to generalize across speakers, maintain temporal smoothness, or preserve view consistency due to architectural limitations. In this paper, we present NeRF-LipSync, a novel generative framework that synthesizes lip movements conditioned on speech audio while maintaining temporal coherence and view-consistent appearance through a combination of diffusion-based modeling and NeRF-based spatial alignment. Our model incorporates temporal attention and leverages rich audio-visual embeddings to produce expressive, speaker-specific articulation. We evaluate NeRF-LipSync on the VoxCeleb2 and LRW datasets and compare it against strong baselines including Wav2Lip, PC-AVS, and Diff2Lip. On VoxCeleb2, our method achieves an FID of 2.75, SSIM of 0.56, PSNR of 18.32, and LMD of 3.01, with synchronization accuracy (Syncc) reaching 9.06. On LRW, it yields an FID of 2.40, SSIM of 0.71, PSNR of 21.03, and LMD of 2.16. These results confirm the strong generalization ability and perceptual realism of our approach. Ablation studies highlight the contribution of NeRF alignment to identity consistency, diffusion to visual expressiveness, and temporal attention to motion stability. NeRF-LipSync thus offers a robust, scalable solution for high-quality, speech-driven avatar animation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alexandr Axyonov

National Research University Higher School of Economics

Mikhail Dolgushin

Russian Academy of Sciences

Dmitry Ryumin

National Research University Higher School of Economics

Journals

The international archives of the photogrammetry, remote sensing and spatial information sciences/International archives of the photogrammetry, remote sensing and spatial information sciences

NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider