What question did this study set out to answer?

The aim is to develop a framework that reconstructs vocal tract geometry and respiratory dynamics while ensuring interpretability and accuracy.

March 3, 2026Open Access

Interpretable vocal tract and respiratory inversion via physics-informed neural operators

Key Points

The aim is to develop a framework that reconstructs vocal tract geometry and respiratory dynamics while ensuring interpretability and accuracy.
Proposes a physics-informed multimodal inversion framework using Kolmogorov-Arnold neural operators.
Aligns paired acoustic and physiological signals for analysis.
Utilizes a nested three-layer KAN network and a gated recursive module for accurate inversion.
Optimizes a super-resolution prediction head to enhance timbre fidelity.
Achieves a log-spectral distortion of 1.83 ± 0.32 dB and a sub-band error detection rate of 6.4 ± 1.1%.
Demonstrates low latency of 14.2-18.3 ms and compact memory usage of 108-121 MB.
Maintains the smallest vocal-tract geometry error (MAE-CSA ≤ 0.23) and respiratory estimation bias (RMSE-P ≤ 0.52).

Abstract

Individual physiological differences in the vocal tract and respiratory system pose major challenges to accurate timbre modeling and low-latency feedback in vocal analysis. Existing data-driven approaches often lack physical interpretability and robustness across speakers. In this work, we propose a physics-informed multimodal inversion framework based on Kolmogorov-Arnold neural (KAN) operators to achieve interpretable reconstruction of vocal tract geometry and respiratory dynamics from acoustic observations. Synchronously acquired acoustic and physiological signals are temporally aligned to form paired inputs. A nested three-layer KAN with learnable B-spline bases inverts the audio spectrum into 19-node vocal-tract cross-sectional areas, while a gated recursive module constrains latent pressure evolution through embedded mass-momentum conservation. To further enhance timbre fidelity, a super-resolution prediction head is jointly optimized using fractional-order temporal regularization and wave-equation residuals. Experiments on a dataset of 1000 subjects demonstrate that the proposed method achieves a log-spectral distortion of 1.83 ± 0.32 dB and a sub-band error detection rate of 6.4 ± 1.1% in the 1.2-2.4 kHz range. The framework also exhibits low end-to-end latency (14.2-18.3 ms) and compact memory usage (108-121 MB) on edge devices, while maintaining the smallest vocal-tract geometry error (MAE-CSA ≤ 0.23) and respiratory estimation bias (RMSE-P ≤ 0.52) across unseen voice types. These results indicate that integrating neural operators with explicit physical constraints enables accurate, interpretable, and real-time inversion of vocal physiology, providing a principled technical foundation for fine-grained timbre reconstruction and personalized vocal analysis.

Bookmark

View Full Paper

Bookmark

View Full Paper

Interpretable vocal tract and respiratory inversion via physics-informed neural operators

Key Points

Abstract

Cite This Study