What question did this study set out to answer?

To improve fundamental frequency control and noise robustness in neural vocoding using probabilistic latent representations.

May 14, 2026

Robust fundamental frequency control in source-filter neural vocoding via probabilistic latent representations

Key Points

To improve fundamental frequency control and noise robustness in neural vocoding using probabilistic latent representations.
Developed VAE-SiFiGAN that uses a variational autoencoder to learn latent representations from mel-spectrograms.
Guided training with hand-crafted features to reduce the entanglement of F0 information and enhance controllability.
Compared VAE-SiFiGAN's performance against SiFi-GAN in experimental settings.
VAE-SiFiGAN shows superior fundamental frequency control compared to SiFi-GAN.
Improves sound quality during F0 manipulation in noisy conditions.
Demonstrates enhanced perceptual quality due to better stochastic speech component capture.

Abstract

Source-filter HiFi-GAN (SiFi-GAN) offers fast and high-quality neural vocoding with fundamental frequency (F0) controllability. However, as in many F0-controllable vocoders, SiFi-GAN relies on hand-crafted acoustic features derived from traditional signal processing; this reliance can degrade sound quality under F0 extrapolation and in noisy conditions owing to mismatches between deterministic feature assumptions and the nature of real speech. We propose VAE-SiFiGAN, which learns probabilistic latent representations from mel-spectrograms via a variational autoencoder. These probabilistic features more effectively capture the stochastic components of speech, thereby improving perceptual quality during F0 manipulation. Moreover, learnable feature extraction improves robustness in noisy conditions. To address the limited F0 controllability caused by entanglement between mel-spectrograms and F0 information, we guide latent representation learning with hand-crafted features used as prior information during training, which are less affected by F0. Experimental results show that VAE-SiFiGAN achieves superior F0 controllability and noise robustness compared to SiFi-GAN. Work partly supported by JST AIP Acceleration Research JPMJCR25U5, Japan.

AI에게 질문

Bookmark

Cite This Study

Ogita et al. (Wed,) studied this question.

synapsesocial.com/papers/6a0567bca550a87e60a1fdd7 https://doi.org/https://doi.org/10.1121/10.0040623

AI에게 질문

Bookmark