Self-supervised learning has revolutionized all speech processing fields, and voice conversion was no exception. Using self-supervised speech representations such as HuBERT and WavLM has become the de facto in voice conversion tasks. However, the great success is often demonstrated under constrained conditions, typically involving read speech from English speakers. In this talk, I will first provide an overview of a typical self-supervised speech representation-based voice conversion framework. I will then highlight key challenges that remain to be addressed, including speaker information leakage, temporal structure modeling, and the application to more complex tasks such as accent conversion and singing voice conversion. These ongoing efforts aim to push the boundaries of what is possible with voice conversion in diverse and real-world scenarios.
Wen-Chin Huang (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: