Key points are not available for this paper at this time.
We explore inference-time optimization strategies to improve text-to-speech (TTS) systems by leveraging automatic perceptual assessment of synthetic speech. Recent advances in non-autoregressive TTS based on parallel iterative decoding have demonstrated strong capabilities in prosodic diversity, duration controllability, and robustness. These models typically generate discrete speech tokens—extracted from speech self-supervised learning models or neural audio codec models—by iteratively unmasking them based on computed generation probabilities. However, selecting tokens to unmask solely based on generation probability does not always yield perceptually optimal speech quality. To address this limitation, we propose a novel decoding strategy that incorporates perceptual speech assessment into the token selection process, guiding selection toward tokens that optimize perceptual ratings at inference time. Specifically, we introduce a best-of-K sampling strategy, in which multiple candidate tokens are sampled and evaluated using a learned perceptual rating predictor; the tokens with the highest predicted rating are selected for unmasking. We investigate three variants of this strategy: (1) iterative application at each unmasking step, (2) one-shot application after the entire unmasking process, and (3) a hybrid approach combining both. Subjective evaluations demonstrate that applying our method to a state-of-the-art zero-shot TTS model improves both the naturalness and speaker similarity of synthetic speech.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kazuki Yamauchi
Yuki Saito
Hiroshi Saruwatari
The Journal of the Acoustical Society of America
The University of Tokyo
Bunkyo University
Building similarity graph...
Analyzing shared references across papers
Loading...
Yamauchi et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6a056668a550a87e60a1e6f6 — DOI: https://doi.org/10.1121/10.0041552
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: