This study proposes CodecMOS, a prediction model that automatically estimates the naturalness Mean Opinion Score (MOS) of singing voices by integrating Self-Supervised Learning (SSL)-based speech representations with latent representations from a Neural Audio Codec (NAC). Although conventional MOS prediction models utilizing SSL representations are highly accurate for read speech, their performance significantly degrades for singing voices due to pitch fluctuations and greater acoustic diversity. CodecMOS aims to complement the semantic information captured by SSL with the acoustic information encoded in the latent representations of NAC. Experimental evaluations comparing several fusion strategies demonstrate that the model fusing wav2vec 2.0 and Descript Audio Codec features via Feature-wise Linear Modulation (FiLM) achieves the highest correlation coefficients, outperforming existing methods. This work is the first to systematically investigate the complementary relationship between SSL and NAC representations in predicting the naturalness of singing voices.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ryoko Arita
Joonyong Park
Wataru Nakata
Nippon Onkyo Gakkaishi/Acoustical science and technology/Nihon Onkyo Gakkaishi
The University of Tokyo
National Institute of Advanced Industrial Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Arita et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69e5c27e03c2939914028b68 — DOI: https://doi.org/10.1250/ast.e25.101