Articulatory physiological data are the core foundation of Mandarin Chinese phonetic research and speech engineering. At present, the multimodal pronunciation physiological datasets for Mandarin Chinese have several limitations, including incomplete coverage, single-modality acquisition, and lack of synchronization, which are difficult to meet the requirements of high precise research. To address this issue, this study constructs a multimodal pronunciation physiological dataset of Mandarin Chinese based on ultrasound tongue imaging, thereby addressing the deficiency of existing datasets in the fusion of multi-dimensional pronunciation physiological information. The dataset covers commonly used valid syllable units formed by combinations of initials and finals under four tone conditions, forming 1,024 complete pronunciation units. Multimodal data consists of four parts: text corpora, speech audio, lip video, and ultrasound tongue imaging, which can comprehensively reflect the physiological movement characteristics and acoustic performance during the pronunciation process. In the data quality control stage, a combination of manual verification and machine screening is adopted to eliminate invalid data such as non-standard pronunciation, blurry images, and audio distortion, ultimately ensuring a high-quality dataset. The dataset not only provides data support for basic research on the physiological mechanism of Mandarin Chinese pronunciation, the rules of tone changes, and second language acquisition, but also has applications in speech synthesis and recognition, diagnosis and rehabilitation of speech disorders, modeling of pronunciation mechanisms, and training of artificial intelligence speech models. At the same time, it offers a reference for cross-language comparative studies on pronunciation physiology.
Zhang et al. (Mon,) studied this question.