What question did this study set out to answer?

The aim is to extract voice style embeddings from TTS models without accessible style encoders.

May 16, 2026Open Access

Extracting Voice Styles from Frozen TTS Models via Gradient-Based Inverse Optimization

Key Points

The aim is to extract voice style embeddings from TTS models without accessible style encoders.
Optimized style conditioning vectors using gradients while keeping model weights frozen.
Employed a perceptual loss derived from WavLM representations for optimization.
Evaluated on SupertonicTTS and Kokoro models featuring different architectures with 44 speakers each.
Achieved 79% of ECAPA-TDNN ceiling performance on SupertonicTTS (SIM_E: 0.452, WER: 2.70%).
Demonstrated 0.42% WER on Kokoro model.
Cross-architecture evaluation confirmed consistent improvements over preset baselines.

Abstract

We present a method to extract voice style embeddings from arbitrary speech samples for a text-to-speech (TTS) system whose style encoder is not publicly available. By enabling gradient backpropagation through the frozen TTS pipeline, we optimize only the style conditioning vectors—all model weights remain frozen—using a perceptual loss derived from WavLM hidden representations. Guided by recent probing analysis showing that early layers of self-supervised speech models best encode speaker-related attributes, we use a single WavLM layer (layer 3) to compute time-averaged feature statistics as our optimization objective. Experiments on two structurally different TTS models—SupertonicTTS (flow matching, 65. 5M params) and Kokoro (StyleTTS 2-based, 81. 8M params) —with 44 speakers per model demonstrate consistent improvements over preset baselines, verified by cross-architecture evaluation with three independent speaker verification models (WavLM-SV, ECAPA-TDNN, ResNet). On SupertonicTTS, our method achieves 79% of the same-speaker ECAPA-TDNN ceiling (SIME: 0. 452) with 2. 70% WER (Kokoro: 0. 42%). Preprint. Manuscript prepared for submission to ICASSP 2027. Code: https: //github. com/kdrkdrkdr/supertonic. embedhttps: //github. com/kdrkdrkdr/kokoro. embed

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper