In this study, we proposed a novel automated speech quality estimation model capable of evaluating perceptual dysphonia severity and breathiness in audio samples, ensuring alignment with expert-rated assessments. The proposed model integrates Whisper ASR embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model’s sensitivity to phonetic, high level feature representation and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range de-pendencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model’s superior generalization across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in speech quality assessment, offering a promising pathway for advancing automated evaluation systems.
Ashkanichenarlogh et al. (Mon,) studied this question.