What type of study is this?

This is a Quantitative Study study.

October 12, 2025Open Access

Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model

Key Points

The automated speech quality model demonstrates superior accuracy in predicting dysphonia severity and breathiness.
Validation against state-of-the-art methods confirms enhanced generalization across diverse audio samples.
The integration of ASR embeddings with deep feature mapping significantly improves perceptual speech quality assessments.
The use of long-range dependencies captured by attention networks further refines temporal dynamics in speech analysis.

Abstract

In this study, we proposed a novel automated speech quality estimation model capable of evaluating perceptual dysphonia severity and breathiness in audio samples, ensuring alignment with expert-rated assessments. The proposed model integrates Whisper ASR embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model’s sensitivity to phonetic, high level feature representation and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range de-pendencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model’s superior generalization across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in speech quality assessment, offering a promising pathway for advancing automated evaluation systems.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper