Key points are not available for this paper at this time.
We present a method to extract voice style embeddings from arbitrary speech samples for a text-to-speech (TTS) system whose style encoder is not publicly available. By enabling gradient backpropagation through the frozen TTS pipeline, we optimize only the style conditioning vectors—all model weights remain frozen—using a perceptual loss derived from WavLM hidden representations. Guided by recent probing analysis showing that early layers of self-supervised speech models best encode speaker-related attributes, we use a single WavLM layer (layer 3) to compute time-averaged feature statistics as our optimization objective. Experiments on two structurally different TTS models—SupertonicTTS (flow matching, 65. 5M params) and Kokoro (StyleTTS 2-based, 81. 8M params) —with 44 speakers per model demonstrate consistent improvements over preset baselines, verified by cross-architecture evaluation with three independent speaker verification models (WavLM-SV, ECAPA-TDNN, ResNet). On SupertonicTTS, our method achieves 79% of the same-speaker ECAPA-TDNN ceiling (SIME: 0. 452) with 2. 70% WER (Kokoro: 0. 42%). Preprint. Manuscript prepared for submission to ICASSP 2027. Code: https: //github. com/kdrkdrkdr/supertonic. embedhttps: //github. com/kdrkdrkdr/kokoro. embed
Building similarity graph...
Analyzing shared references across papers
Loading...
Gyeongmin Kim
Hanyang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gyeongmin Kim (Thu,) studied this question.
www.synapsesocial.com/papers/6a080af2a487c87a6a40cfc8 — DOI: https://doi.org/10.5281/zenodo.20023257