What question did this study set out to answer?

The aim is to develop a spoken English training model that improves pronunciation and provides immediate feedback.

May 22, 2026Open Access

Spoken English Assisted Training Model Based on Multifeature Parameters and Dynamic Time Warping Algorithm

Key Points

The aim is to develop a spoken English training model that improves pronunciation and provides immediate feedback.
Proposed model combines SMFCC, vocal intensity, and fundamental frequency trajectories.
Utilized an improved dynamic time warping algorithm with constrained slope ranges for pronunciation feature matching.
Evaluated using the VOIP-EN-10H dataset to assess performance under high-noise conditions.
Achieved a word error rate of 0.112, a 56.7% reduction compared to conventional methods.
Maintained a signal-to-noise ratio of 19 dB under 25 dB high-noise conditions with over 90% accuracy for students.
Users experienced a 30% improvement in expression proficiency, validating accuracy and practical applicability.

Abstract

ABSTRACT With the acceleration of internationalization, the deficiency of traditional English classroom in English spoken teaching is becoming more obvious, especially the lack of effectiveness and immediate feedback. Therefore, a spoken English assisted training model is proposed. The model combines Smooth Mel‐Frequency Cepstral Coefficient (SMFCC), vocal intensity, and fundamental frequency trajectories, while employing an improved dynamic time warping (DTW) algorithm with constrained slope ranges to achieve pronunciation feature matching. SMFCC enhances feature stability by applying threshold smoothing to short‐term amplitude spectra, effectively suppressing fundamental frequency interference and high‐frequency noise. The enhanced DTW algorithm reduces computational complexity through predefined parallelogram search regions. Experimental results demonstrate that the model achieves a word error rate of 0.112 on the VOIP‐EN‐10H dataset, representing a 56.7% reduction compared to conventional methods. Under 25 dB high‐noise conditions, the signal‐to‐noise ratio remains at 19 dB, with word recognition accuracy exceeding 90% for students and 80% for general populations. Users experience a 30% improvement in expression proficiency, validating the model's advantages in accuracy, noise resistance, and practical applicability. To sum up, this spoken English assisted training model can objectively evaluate students' English pronunciation, which provides great convenience and advantages for English learning.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper