What question did this study set out to answer?

The aim is to develop automated, deep learning-based tools for analyzing pathological speech, improving clinical assessment and therapy.

February 2, 2026Open Access

Deep Learning Approaches for Clinical Speech Analysis: From Latent Representations to Articulatory Estimation

Key Points

The aim is to develop automated, deep learning-based tools for analyzing pathological speech, improving clinical assessment and therapy.
Developed a method for automatic intelligibility assessment using latent speech representations.
Investigated pre-training of self-supervised learning models on healthy speech versus pathological speech.
Created a multi-task learning architecture for estimating articulatory trajectories and phoneme sequences from raw audio.
Proposed a comprehensive framework for speech articulation analysis, integrating several analytic tasks.
Achieved high performance in predicting intelligibility scores with reduced reference signal reliance.
Demonstrated that pre-training exclusively on healthy speech avoids harmful biases.
Successfully predicted articulatory movements and phonetic information without relying on transcriptions.
The integrated framework shows promise for therapy-oriented applications in communication disorders.

Abstract

The analysis of pathological speech, particularly in individuals recovering from conditions such as stroke, presents significant challenges for clinical assessment and therapy. Traditional methods are often subjective and labor-intensive, highlighting the need for objective, automated tools. This thesis addresses this need by developing and evaluating novel deep learning approaches that bridge acoustic signals with their underlying articulatory and linguistic representations, aiming to provide robust and clinically relevant metrics for speech analysis. The contributions of this work span several key areas. First, a novel method for automatic intelligibility assessment of pathological speech is introduced. Leveraging disentangled latent speech representations derived from a voice conversion-inspired architecture, this approach significantly reduces the reliance on multiple reference signals and mitigates speaker variability, achieving high performance in predicting intelligibility scores from limited data. Next, the influence of pathological speech during the pre-training of self-supervised learning models is investigated, using a prominent architecture for learning representations from raw audio in downstream pathology detection. Results show that exclusive pre-training on healthy speech yields superior performance, suggesting that atypical data can introduce harmful inductive biases - a critical consideration when adapting foundation models to clinical applications. The third part presents a novel multi-task learning architecture for the speaker- and text-independent joint estimation of continuous articulatory trajectories (derived from electromagnetic articulography) and discrete phoneme sequences, including their temporal alignments, directly from raw audio. This work defines a new task of jointly inverting speech into its articulatory and phonetic components and proposes model variants that robustly predict articulatory motion and phonetic information without requiring textual transcriptions during inference. Building on this model, the final contribution extends it into an integrated, multi-level end-to-end framework for comprehensive speech articulation and spoken language analysis. This framework enables phoneme error rate-based intelligibility assessment, articulatory category analysis, comparison of articulatory trajectories with synthesized references, and open-vocabulary keyword spotting. It demonstrates potential for therapy-oriented applications by transforming raw speech into meaningful linguistic and articulatory insights. Together, these contributions advance clinical speech analysis by providing robust, data-driven tools that offer deeper insights into speech production mechanisms in both healthy and pathological speech. The developed methods lay the groundwork for more objective, scalable, and personalized therapeutic interventions, ultimately aiming to improve the quality of care for individuals with communication disorders.

Deep Learning Approaches for Clinical Speech Analysis: From Latent Representations to Articulatory Estimation

Key Points

Abstract

Cite This Study