What question did this study set out to answer?

The aim is to develop a prediction model for estimating the naturalness of singing voices using advanced audio representations.

April 20, 2026Open Access

CodecMOS: Singing MOS Prediction through the Integration of Self-Supervised Speech Representations and Neural Audio Codec Features

Key Points

The aim is to develop a prediction model for estimating the naturalness of singing voices using advanced audio representations.
Developed CodecMOS model integrating self-supervised speech representations and neural audio codec features.
Evaluated different fusion strategies for combining the two types of representations.
Used wav2vec 2.0 and Descript Audio Codec features for model input.
Applied Feature-wise Linear Modulation (FiLM) to enhance performance.
CodecMOS achieved the highest correlation coefficients compared to traditional MOS prediction models.
Demonstrated improved accuracy for singing voices over conventional methods.
Showed that the combination of SSL and NAC can effectively address challenges posed by pitch fluctuations.

Abstract

This study proposes CodecMOS, a prediction model that automatically estimates the naturalness Mean Opinion Score (MOS) of singing voices by integrating Self-Supervised Learning (SSL)-based speech representations with latent representations from a Neural Audio Codec (NAC). Although conventional MOS prediction models utilizing SSL representations are highly accurate for read speech, their performance significantly degrades for singing voices due to pitch fluctuations and greater acoustic diversity. CodecMOS aims to complement the semantic information captured by SSL with the acoustic information encoded in the latent representations of NAC. Experimental evaluations comparing several fusion strategies demonstrate that the model fusing wav2vec 2.0 and Descript Audio Codec features via Feature-wise Linear Modulation (FiLM) achieves the highest correlation coefficients, outperforming existing methods. This work is the first to systematically investigate the complementary relationship between SSL and NAC representations in predicting the naturalness of singing voices.

AI से पूछें

Bookmark

View Full Paper