What type of study is this?

September 10, 2025Open Access

Research on Real-scene Video Face Restoration Methods Based on Time Consistency and Multimodal Fusion

Key Points

The method achieves high-quality video face restoration with improved time consistency using audio and visual inputs.
Experimental results on the VoxCeleb2 dataset show better performance in PSNR, SSIM, and LPIPS compared to single-modal techniques.
A multi-stage framework utilizes HOG and MFCC features to drive video reconstruction, showcasing 3D convolutional network effectiveness.
Enhancements in spatio-temporal consistency are achieved through optical flow smoothing techniques and simplified modules.

Abstract

This paper proposes a simplified audio-guided video face restoration method. The goal is to recover high-quality, temporally consistent face videos. We designed a multi-stage framework that integrates audio and visual modalities through simple yet effective components. Specifically, we extract low-level HOG features from video frames and MFCC features from audio. We then utilize a simplified 3D convolutional network to predict dictionary indices guided by both modalities. A pre-trained TS-VQGAN decoder reconstructs high-quality frames. Simplified spatio-temporal fidelity modules and optical flow smoothing techniques are simultaneously applied to enhance spatio-temporal consistency. Experimental results on the VoxCeleb2 dataset demonstrate that our method outperforms single-modal methods such as BasicVSR++ and VQF in terms of PSNR, SSIM, and LPIPS metrics. This indicates that cross-modal fusion can still deliver consistent performance improvements in practical video restoration tasks even under a simplified structure.

Research on Real-scene Video Face Restoration Methods Based on Time Consistency and Multimodal Fusion

Key Points

Abstract

Cite This Study

Also Consider

Also Consider