Individual physiological differences in the vocal tract and respiratory system pose major challenges to accurate timbre modeling and low-latency feedback in vocal analysis. Existing data-driven approaches often lack physical interpretability and robustness across speakers. In this work, we propose a physics-informed multimodal inversion framework based on Kolmogorov-Arnold neural (KAN) operators to achieve interpretable reconstruction of vocal tract geometry and respiratory dynamics from acoustic observations. Synchronously acquired acoustic and physiological signals are temporally aligned to form paired inputs. A nested three-layer KAN with learnable B-spline bases inverts the audio spectrum into 19-node vocal-tract cross-sectional areas, while a gated recursive module constrains latent pressure evolution through embedded mass-momentum conservation. To further enhance timbre fidelity, a super-resolution prediction head is jointly optimized using fractional-order temporal regularization and wave-equation residuals. Experiments on a dataset of 1000 subjects demonstrate that the proposed method achieves a log-spectral distortion of 1.83 ± 0.32 dB and a sub-band error detection rate of 6.4 ± 1.1% in the 1.2-2.4 kHz range. The framework also exhibits low end-to-end latency (14.2-18.3 ms) and compact memory usage (108-121 MB) on edge devices, while maintaining the smallest vocal-tract geometry error (MAE-CSA ≤ 0.23) and respiratory estimation bias (RMSE-P ≤ 0.52) across unseen voice types. These results indicate that integrating neural operators with explicit physical constraints enables accurate, interpretable, and real-time inversion of vocal physiology, providing a principled technical foundation for fine-grained timbre reconstruction and personalized vocal analysis.
Deng et al. (Sat,) studied this question.