What question did this study set out to answer?

The study aims to improve pronunciation assessment at the phoneme level for ESL learners using a new model that provides detailed feedback.

March 7, 2026Open Access

An Articulatory-Aware CNN-BiGRU-Attention Framework for Explainable Phoneme-Level Pronunciation Assessment in ESL Speech

Key Points

The study aims to improve pronunciation assessment at the phoneme level for ESL learners using a new model that provides detailed feedback.
Developed a hybrid CNN-BiGRU-Attention architecture for phoneme recognition.
Implemented an Articulatory Error Mapping Engine to classify articulation errors.
Evaluated the framework's performance on non-native English speech with deep learning techniques.
Achieved a phoneme recognition accuracy of 91.4%, outperforming commercial ASR systems and traditional HMM-GMM models.
Demonstrated high sensitivity to ESL pronunciation errors with various accuracy rates: 84% for substitution, 82% for deletion, and 79% for insertions.
Achieved over 87% accuracy in articulatory mapping across all error categories.

Abstract

Proper pronunciation at the phoneme level has been known to be one of the most enduring problems affecting the Second Language learners of the English language (ESL) since the slight pronunciatory variations in the learned language may greatly influence its communicative power and the level of intelligibility. The existing methods of pronunciation evaluation, which are mostly made using automatic speech recognition (ASR), place their results at the word level or the sentence level and offer generic numerical scores with little linguistic meaning, which is not effective in assessing accented speech and subsequent correction. To overcome these shortcomings, the paper introduces an articulatory-conscious recognition model of phonemes that provides fine-grained and interpretable feedback to enhance ESL pronunciation. The novelty of the work is in the combination of a hybrid CNN-BiGRU-Attention architecture and an Articulatory Error Mapping Engine, which symbolically transforms phoneme-level articulation errors into articulatory errors, based on place of articulation, manner, voicing, and vowel quality articulatory deviations. The experimental analysis performed on the non-native English speech had a phoneme recognition accuracy of 91.4 that was much higher than the commercial ASR-based systems (78.3) and the traditional HMM-GMM baselines (70.5). The system was very sensitive to ESL pronunciation errors, making it 84 percent accurate in substitution, 82 percent accurate in deletion and 79 percent accurate in insertions in detection and articulatory mapping was over 87 percent accurate in all categories. The framework was tested in Python with deep learning packages and speech processing toolkits, and provided a scalable, explainable, and learner-focused system that can be used to support the intelligent training of ESL pronunciation and provide pedagogically significant feedback at the phoneme level.

Bookmark

View Full Paper

Cite This Study

Bindhu et al. (Thu,) studied this question.

synapsesocial.com/papers/69abc1c65af8044f7a4eab55 https://doi.org/https://doi.org/10.14569/ijacsa.2026.0170261

Bookmark

View Full Paper