This paper proposes a speech synthesis technique based on a neural sequence-to-sequence (Seq2Seq) model that incorporates the structure of hidden semi-Markov models (HSMMs). Although Seq2Seq models with attention mechanisms have achieved high-quality synthesis, they suffer from alignment instability and the absence of explicit duration modeling, making direct duration control difficult. To address these challenges, recent approaches have explored models that incorporate explicit alignment and duration representations instead of attention mechanisms. However, such methods still fall short of the fully consistent duration handling achieved by traditional HSMM-based synthesis. The proposed model is a theoretically well-grounded deep generative model that integrates an HSMM structure into a variational autoencoder (VAE). It performs a probabilistic full-space alignment search considering duration probabilities, and its training algorithm is derived purely from the maximization of the evidence lower bound (ELBO), without relying on heuristic assumptions or auxiliary criteria. A key contribution of this work is the identification of an essential two-stage approximation necessary for the proposed model: 1) a conjugate posterior distribution with an HSMM structure, and 2) a subsequent mean-field approximation for the VAE decoder. Furthermore, interpreting the proposed model as a Seq2Seq model with an HSMM-structured attention mechanism establishes a theoretical connection between attention mechanisms and explicit alignment modeling. Experiments on a Japanese speech database demonstrate that the proposed method achieves higher-quality synthesized speech than conventional neural network-based acoustic models, while maintaining high modeling efficiency even with limited training data.
Nankaku et al. (Thu,) studied this question.