What question did this study set out to answer?

The aim is to develop a perception norm that models mispronunciation detection based on individual annotator decisions rather than a broader consensus.

April 1, 2026Open Access

Perception Norm for Mispronunciation Detection

Key Points

The aim is to develop a perception norm that models mispronunciation detection based on individual annotator decisions rather than a broader consensus.
Introduced a new corpus (UY/CH-CHILD-MA) of Uyghur-accented Mandarin with multiple annotations.
Analyzed inter-annotator variation in mispronunciation judgments.
Utilized a Transformer model with pre-training and fine-tuning to capture annotator-specific patterns.
Developed a committee ensemble model using application-matched aggregation rules.
Demonstrated significant inter-annotator variation in mispronunciation detection.
Achieved high accuracy in learning annotator-specific decision patterns using the Transformer model.
Presented a task-specific assessment method through aggregated annotator models.

Abstract

Mispronunciation detection (MD) is a key component in computer-assisted pronunciation training (CAPT) and speaking tests. Most MD systems adopt a production view, measuring phone-level deviation from a canonical pronunciation (Native Norm) or the expected pronunciation of a target population (Target Norm). Yet, pronunciation assessment is fundamentally perceptual: listeners map speech to linguistic categories under uncertainty and with individual psychological priors, so judgments are inherently subjective and lack a single gold standard. Labels are therefore often aggregated (e.g., voting), but aggregation rules are themselves subjective, require many annotators, and entangle individual perception with social consensus, complicating model training. In this paper, we propose a “Perception Norm”, which models MD as the decision process of individual annotators and trains models to simulate single listeners rather than an annotator pool. To support this study, we introduce UY/CH-CHILD-MA, a corpus of Uyghur-accented child Mandarin words and phrases with four independent phone-level annotations. Our experiments reveal substantial inter-annotator variation and show that a Transformer with pre-training and fine-tuning can learn annotator-specific patterns with high accuracy. Finally, we present a committee ensemble that combines annotator models using application-matched aggregation rules to produce task-specific assessments. The data and source code will be made publicly available upon publication.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Nijat et al. (Sun,) studied this question.

synapsesocial.com/papers/69ccb72e16edfba7beb89151 https://doi.org/https://doi.org/10.3390/app16073311

Bookmark

View Full Paper