Despite ongoing efforts to standardize foreign language (L2) speaking assessment, the validity and reliability of human ratings remains contested due to their inherent subjectivity and the limited transparency of underlying judgment processes. While prior quantitative research has explored correlations between rater scores and measures of complexity, accuracy, fluency, and pronunciation (CAFP), relatively few studies have examined why particular linguistic features carry more weight than others. Addressing this gap, the present study employed a multilayered mixed-methods approach to investigate the relationship between CAFP indices and holistic ratings in an opinion-based, monologic English-speaking test at a Malaysian university. Quantitative analysis of 76 L2 speakers’ performances revealed that global accuracy, intelligible pronunciation, fluency (characterized by fewer pauses and repairs), and lexical sophistication (use of academic vocabulary) were strong predictors of higher scores, whereas syntactic complexity, lexical density, lexical diversity, and speech rate exerted no influence. Qualitative data from think-aloud protocols, interviews, and observation notes showed raters tended to prioritize perceived communicative effectiveness and relied on salient features under cognitive load, often overlooking less prominent aspects. The findings underscore the need to refine rater training and rubric design to mitigate judgment bias and cognitive fatigue, thereby supporting fairness and validity in L2 speaking assessment. At a broader level, these results offer an evidence base to enhance rater calibration and scoring consistency in L2 speaking assessment worldwide.
Hu et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: