The analysis of pathological speech, particularly in individuals recovering from conditions such as stroke, presents significant challenges for clinical assessment and therapy. Traditional methods are often subjective and labor-intensive, highlighting the need for objective, automated tools. This thesis addresses this need by developing and evaluating novel deep learning approaches that bridge acoustic signals with their underlying articulatory and linguistic representations, aiming to provide robust and clinically relevant metrics for speech analysis. The contributions of this work span several key areas. First, a novel method for automatic intelligibility assessment of pathological speech is introduced. Leveraging disentangled latent speech representations derived from a voice conversion-inspired architecture, this approach significantly reduces the reliance on multiple reference signals and mitigates speaker variability, achieving high performance in predicting intelligibility scores from limited data. Next, the influence of pathological speech during the pre-training of self-supervised learning models is investigated, using a prominent architecture for learning representations from raw audio in downstream pathology detection. Results show that exclusive pre-training on healthy speech yields superior performance, suggesting that atypical data can introduce harmful inductive biases - a critical consideration when adapting foundation models to clinical applications. The third part presents a novel multi-task learning architecture for the speaker- and text-independent joint estimation of continuous articulatory trajectories (derived from electromagnetic articulography) and discrete phoneme sequences, including their temporal alignments, directly from raw audio. This work defines a new task of jointly inverting speech into its articulatory and phonetic components and proposes model variants that robustly predict articulatory motion and phonetic information without requiring textual transcriptions during inference. Building on this model, the final contribution extends it into an integrated, multi-level end-to-end framework for comprehensive speech articulation and spoken language analysis. This framework enables phoneme error rate-based intelligibility assessment, articulatory category analysis, comparison of articulatory trajectories with synthesized references, and open-vocabulary keyword spotting. It demonstrates potential for therapy-oriented applications by transforming raw speech into meaningful linguistic and articulatory insights. Together, these contributions advance clinical speech analysis by providing robust, data-driven tools that offer deeper insights into speech production mechanisms in both healthy and pathological speech. The developed methods lay the groundwork for more objective, scalable, and personalized therapeutic interventions, ultimately aiming to improve the quality of care for individuals with communication disorders.
Tobias Weise (Thu,) studied this question.