What question did this study set out to answer?

This research aims to improve speech synthesis quality by integrating hidden semi-Markov models into a deep generative framework.

April 25, 2026Open Access

Deep Hidden Semi-Markov Model-Based Speech Synthesis

Key Points

This research aims to improve speech synthesis quality by integrating hidden semi-Markov models into a deep generative framework.
Developed a speech synthesis model incorporating HSMM structure within a variational autoencoder.
Utilized a probabilistic full-space alignment search based on duration probabilities.
Optimized the training algorithm using maximization of the evidence lower bound without heuristic assumptions.
The proposed model outperforms conventional neural network-based acoustic models in speech quality.
Achieved efficient modeling even with limited training data.

Abstract

This paper proposes a speech synthesis technique based on a neural sequence-to-sequence (Seq2Seq) model that incorporates the structure of hidden semi-Markov models (HSMMs). Although Seq2Seq models with attention mechanisms have achieved high-quality synthesis, they suffer from alignment instability and the absence of explicit duration modeling, making direct duration control difficult. To address these challenges, recent approaches have explored models that incorporate explicit alignment and duration representations instead of attention mechanisms. However, such methods still fall short of the fully consistent duration handling achieved by traditional HSMM-based synthesis. The proposed model is a theoretically well-grounded deep generative model that integrates an HSMM structure into a variational autoencoder (VAE). It performs a probabilistic full-space alignment search considering duration probabilities, and its training algorithm is derived purely from the maximization of the evidence lower bound (ELBO), without relying on heuristic assumptions or auxiliary criteria. A key contribution of this work is the identification of an essential two-stage approximation necessary for the proposed model: 1) a conjugate posterior distribution with an HSMM structure, and 2) a subsequent mean-field approximation for the VAE decoder. Furthermore, interpreting the proposed model as a Seq2Seq model with an HSMM-structured attention mechanism establishes a theoretical connection between attention mechanisms and explicit alignment modeling. Experiments on a Japanese speech database demonstrate that the proposed method achieves higher-quality synthesized speech than conventional neural network-based acoustic models, while maintaining high modeling efficiency even with limited training data.

Deep Hidden Semi-Markov Model-Based Speech Synthesis

Key Points

Abstract

Cite This Study