With the expanding application of audio–video synthesis technology across multimedia domains, generating high-quality video while ensuring precise synchronization between audio and video has emerged as a critical research challenge. Traditional audio–video synthesis approaches exhibit marked deficiencies in temporal dependency and multimodal feature integration. To address these limitations, this study introduces the Audio–Visual Enhanced Discriminator Network (AuVi-EDNet) framework, which leverages conditional Generative Adversarial Networks, Gated Recurrent Unit (GRU) networks, and multimodal fusion techniques to tackle the task of audio–video synchronization. The proposed framework enhances synchronization accuracy and naturalness by extracting audio and Mel-Frequency Cepstral Coefficient features using GRU networks, coupled with spatiotemporal features from red–green–blue image sequences. To further refine multimodal data fusion, the model incorporates an audio–video synchronization discriminator and a global spatial attention mechanism, effectively capturing the temporal dependencies inherent in both audio and video streams. Experimental results demonstrate that AuVi-EDNet surpasses traditional methods on the Lip Reading in the Wild and Lip Reading Sentences 3-TED datasets, delivering superior accuracy and robustness in audio–video synchronization and generation quality. Moreover, the analysis grounded in this framework offers vital insights for advancing audio–video synthesis technologies and research in multimodal digital-media evaluation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Bei Zhang
PeerJ Computer Science
Building similarity graph...
Analyzing shared references across papers
Loading...
Bei Zhang (Fri,) studied this question.
www.synapsesocial.com/papers/69d1fde4a79560c99a0a4484 — DOI: https://doi.org/10.7717/peerj-cs.3753