Los puntos clave no están disponibles para este artículo en este momento.
We propose using self-supervised discrete representations for the task of resynthesis. To generate disentangled representation, we separately low-bitrate representations for speech content, prosodic information, speaker identity. This allows to synthesize speech in a controllable. We analyze various state-of-the-art, self-supervised representation methods and shed light on the advantages of each method while reconstruction quality and disentanglement properties. , we evaluate the F0 reconstruction, speaker identification (for both resynthesis and voice conversion), recordings', and overall quality using subjective human evaluation. Lastly, demonstrate how these representations can be used for an ultra-lightweight codec. Using the obtained representations, we can get to a rate of 365 per second while providing better speech quality than the baseline. Audio samples can be found under the following link: . github. io/resynthesis.
Polyak et al. (Fri,) studied this question.