Key points are not available for this paper at this time.
End-to-end speech coding models achieve high coding gains by learning compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens. Instead of using its decoder, we employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder's burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. We investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions in ablation studies. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. We open-source the project for reproducibility 1 .
Yang et al. (Mon,) studied this question.