ABSTRACT With the acceleration of internationalization, the deficiency of traditional English classroom in English spoken teaching is becoming more obvious, especially the lack of effectiveness and immediate feedback. Therefore, a spoken English assisted training model is proposed. The model combines Smooth Mel‐Frequency Cepstral Coefficient (SMFCC), vocal intensity, and fundamental frequency trajectories, while employing an improved dynamic time warping (DTW) algorithm with constrained slope ranges to achieve pronunciation feature matching. SMFCC enhances feature stability by applying threshold smoothing to short‐term amplitude spectra, effectively suppressing fundamental frequency interference and high‐frequency noise. The enhanced DTW algorithm reduces computational complexity through predefined parallelogram search regions. Experimental results demonstrate that the model achieves a word error rate of 0.112 on the VOIP‐EN‐10H dataset, representing a 56.7% reduction compared to conventional methods. Under 25 dB high‐noise conditions, the signal‐to‐noise ratio remains at 19 dB, with word recognition accuracy exceeding 90% for students and 80% for general populations. Users experience a 30% improvement in expression proficiency, validating the model's advantages in accuracy, noise resistance, and practical applicability. To sum up, this spoken English assisted training model can objectively evaluate students' English pronunciation, which provides great convenience and advantages for English learning.
Shang et al. (Fri,) studied this question.