October 16, 2025Open Access

Improved Dysarthric Speech to Text Conversion via TTS Personalization

PMPéter MihajlikEötvös Loránd University ÉSÉva SzékelyKTH Royal Institute of Technology PBPiroska Zsófia BartaBudapest University of Technology and Economics

Key Points

Fine-tuning the ASR model on both real and synthetic dysarthric speech reduced character error rate from 36-51% to 7.3%.
Using synthetic speech for ASR training demonstrated an 18% relative CER reduction compared to using only real data.
The method introduces synthetic dysarthric speech generation leveraging premorbidity recordings and speaker embedding interpolation.
Personalized ASR systems show potential in enhancing accessibility for individuals with severe speech impairments.

Abstract

We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7. 3%. Our monolingual FastConformerHu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.

AI에게 질문

Bookmark

View Full Paper