Key points are not available for this paper at this time.
Applications for language learning frequently use automatic pronunciation assessment models. An important task that greatly relies on the automatic speech recognition (ASR) is the automatic fluency assessment of spontaneous speech in the absence of reference material. Using combined prosodic, completeness, and fluency scores, this research implements an innovative way to get around such limitations. The dynamic temporal warping (DTW) matching of the pitch contours of a weighted average of the context tokens present in the audio file, which is rich in mispronunciation phonemes, is used to perform this issue. The speechocean762 dataset has been used to validate the better outcomes. This implemented model achieved better results values of 0.980 of Corre, and 0.072 of MSE, 0.753 of PCC, 0.6534 of rounded PCC, and 0.1122 of rounded MSE. This implemented model was compared with existing methods such as multimodal automatic speech fluency assessment model and end-to-end (E2E-R) methods.
Bo Xu (Fri,) studied this question.