May 31, 2024Open Access

Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

ASAh-Hyung ShinKyung Hee University JLJae Ho LeeSeoul National University JHJiwon HwangTheodore Roosevelt High School

Key Points

Key points are not available for this paper at this time.

Abstract

Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio-synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high-frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio-synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross-modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN-based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi-level SyncNet loss for accurate lip synchronization. We also propose a novel cross-attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high-frequency details. We demonstrate that the proposed method renders realistic and audio-synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+ 4.7%), SSIM (+ 2.2%), LMD (+ 51.3%), and SyncNet Confidence (+ 154.7%) compared to the NeRF-based current state-of-the-art methods.

AI에게 질문

Bookmark

View Full Paper