In recent years, speech-driven facial synthesis has attracted significant attention due to its wide applications in virtual humans, remote conferencing, and digital human generation. However, existing methods still face limitations in terms of realism, synchronization, and robustness, primarily due to noise interference in speech signals and insufficient precision in audio-visual feature fusion. To address these challenges, this paper proposes an enhanced speech-driven facial synthesis framework: RAE-NeRF (Residual-based Audio-video Encoder with Neural Radiance Fields). The framework integrates three core modules: (1) the ZipEnhancer speech enhancement module, which extracts high-quality features from noisy speech; (2) a residual-based audio-visual encoder that effectively fuses audio and visual features to drive facial expressions accurately; and (3) a tri-plane hash encoder that achieves high-quality 3D facial modeling and rendering while maintaining efficiency. Extensive experiments conducted on multiple datasets demonstrate that RAE-NeRF significantly outperforms existing mainstream approaches in terms of realism, lip-sync accuracy, and noise robustness, validating the proposed framework’s effectiveness and superiority in complex environments for speech-driven facial synthesis.
Building similarity graph...
Analyzing shared references across papers
Loading...
W. PANG
Sun Yat-sen University
Xiang Li
University of California, Los Angeles
Tao‐Tao Tang
Zhongda Hospital Southeast University
Building similarity graph...
Analyzing shared references across papers
Loading...
PANG et al. (Fri,) studied this question.
synapsesocial.com/papers/68da58e0c1728099cfd11615 — DOI: https://doi.org/10.20944/preprints202509.2231.v1
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: