What type of study is this?

This is a Experimental Study study.

September 29, 2025Open Access

RAE-NeRF: Residual-Based Audio-Video Encoder with Denoising in Talking Head Synchronization

WPW. PANGSun Yat-sen University XLXiang LiUniversity of California, Los Angeles TTTao‐Tao TangZhongda Hospital Southeast University

Key Points

RAE-NeRF achieved superior realism in speech-driven facial synthesis, enhancing user engagement and visual fidelity.
In extensive experiments, the framework outperformed traditional methods by improving lip-sync accuracy by a notable margin.
The residual-based audio-visual encoder effectively fused audio and visual features, facilitating accurate facial representations.
Utilization of the tri-plane hash encoder enabled efficient 3D facial modeling while maintaining high quality in complex settings.

Abstract

In recent years, speech-driven facial synthesis has attracted significant attention due to its wide applications in virtual humans, remote conferencing, and digital human generation. However, existing methods still face limitations in terms of realism, synchronization, and robustness, primarily due to noise interference in speech signals and insufficient precision in audio-visual feature fusion. To address these challenges, this paper proposes an enhanced speech-driven facial synthesis framework: RAE-NeRF (Residual-based Audio-video Encoder with Neural Radiance Fields). The framework integrates three core modules: (1) the ZipEnhancer speech enhancement module, which extracts high-quality features from noisy speech; (2) a residual-based audio-visual encoder that effectively fuses audio and visual features to drive facial expressions accurately; and (3) a tri-plane hash encoder that achieves high-quality 3D facial modeling and rendering while maintaining efficiency. Extensive experiments conducted on multiple datasets demonstrate that RAE-NeRF significantly outperforms existing mainstream approaches in terms of realism, lip-sync accuracy, and noise robustness, validating the proposed framework’s effectiveness and superiority in complex environments for speech-driven facial synthesis.

Ask AI

Helpful

Bookmark

View Full Paper