Source-filter HiFi-GAN (SiFi-GAN) offers fast and high-quality neural vocoding with fundamental frequency (F0) controllability. However, as in many F0-controllable vocoders, SiFi-GAN relies on hand-crafted acoustic features derived from traditional signal processing; this reliance can degrade sound quality under F0 extrapolation and in noisy conditions owing to mismatches between deterministic feature assumptions and the nature of real speech. We propose VAE-SiFiGAN, which learns probabilistic latent representations from mel-spectrograms via a variational autoencoder. These probabilistic features more effectively capture the stochastic components of speech, thereby improving perceptual quality during F0 manipulation. Moreover, learnable feature extraction improves robustness in noisy conditions. To address the limited F0 controllability caused by entanglement between mel-spectrograms and F0 information, we guide latent representation learning with hand-crafted features used as prior information during training, which are less affected by F0. Experimental results show that VAE-SiFiGAN achieves superior F0 controllability and noise robustness compared to SiFi-GAN. Work partly supported by JST AIP Acceleration Research JPMJCR25U5, Japan.
Ogita et al. (Wed,) studied this question.