What question did this study set out to answer?

The aim is to develop a framework that synthesizes vocal tract motion visuals from speech signals for clinical assessments.

March 28, 2026Open Access

A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI

Key Points

The aim is to develop a framework that synthesizes vocal tract motion visuals from speech signals for clinical assessments.
Developed an audio-to-video generation framework using diffusion models.
Preprocessed real-time/cine-MRI sequences and speech signals to achieve temporal alignment.
Integrated spatial and temporal diffusion blocks for enhanced data synchronization.
Evaluated on healthy controls and tongue cancer patients, comparing vocal tract movements.
Successfully generated MRI sequences from new speech inputs.
Demonstrated effective generalization and adaptability in synthesizing vocal tract visuals.
Human evaluations confirmed high realism and accuracy in generated visualizations.

Abstract

• This study introduces an audio-to-video generation framework that synthesize vocal tract dynamics from real-time and cine-MRI using diffusion models. • Processes speech data from diverse populations, including tongue cancer patients, validated by quantitative and qualitative evaluations. • One of the first studies to integrate spatial and temporal diffusion blocks in MRI, enhancing audio-visual data alignment for clinical analysis. Understanding the relationship between vocal tract motion during speech and the resulting acoustic signal is crucial for aided clinical assessment and developing personalized treatment and rehabilitation strategies. Toward this goal, we introduce an audio-to-video generation framework for creating Real Time/cine-Magnetic Resonance Imaging (RT-/cine-MRI) visuals of the vocal tract from speech signals. Our framework first preprocesses RT-/cine-MRI sequences and speech samples to achieve temporal alignment, ensuring synchronization between visual and audio data. We then employ a modified stable diffusion model, integrating structural and temporal blocks, to effectively capture movement characteristics and temporal dynamics in the synchronized data. This process enables the generation of MRI sequences from new speech inputs, improving the conversion of audio into visual data. We evaluated our framework on healthy controls and tongue cancer patients by analyzing and comparing the vocal tract movements in synthesized videos. Our framework demonstrated adaptability to new speech inputs and effective generalization. In addition, positive human evaluations confirmed its effectiveness, with realistic and accurate visualizations, suggesting its potential for outpatient therapy and personalized simulation of vocal tract visualizations.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper