Mispronunciation detection (MD) is a key component in computer-assisted pronunciation training (CAPT) and speaking tests. Most MD systems adopt a production view, measuring phone-level deviation from a canonical pronunciation (Native Norm) or the expected pronunciation of a target population (Target Norm). Yet, pronunciation assessment is fundamentally perceptual: listeners map speech to linguistic categories under uncertainty and with individual psychological priors, so judgments are inherently subjective and lack a single gold standard. Labels are therefore often aggregated (e.g., voting), but aggregation rules are themselves subjective, require many annotators, and entangle individual perception with social consensus, complicating model training. In this paper, we propose a “Perception Norm”, which models MD as the decision process of individual annotators and trains models to simulate single listeners rather than an annotator pool. To support this study, we introduce UY/CH-CHILD-MA, a corpus of Uyghur-accented child Mandarin words and phrases with four independent phone-level annotations. Our experiments reveal substantial inter-annotator variation and show that a Transformer with pre-training and fine-tuning can learn annotator-specific patterns with high accuracy. Finally, we present a committee ensemble that combines annotator models using application-matched aggregation rules to produce task-specific assessments. The data and source code will be made publicly available upon publication.
Nijat et al. (Sun,) studied this question.