Key points are not available for this paper at this time.
Voice disentanglement, the process of isolating speech or singing voice into several latent subspaces, each representing certain aspects, holds significant importance in diverse audio processing applications. In this paper, we propose an efficient weakly-supervised approach to tackle this challenge. Unlike most existing weakly-supervised methods that handle fixed-length sequences and single-rate representations, our approach employs transformers and variational autoencoders to support variable-length sequences and multi-rate representations. Furthermore, by integrating a swapping technique for paired weak-supervision, we show that it could lead to optimal disentanglement and demonstrate its optimal efficacy in our model. Experimental evaluation on VocalSet for singing voice disentanglement shows the superiority of our approach in finding more disentangled singing voice representations. Similarly, tests on LibriSpeech for speech recognition highlight our method's effectiveness in removing speaker information from speech content.
Izadi et al. (Mon,) studied this question.