What type of study is this?

This is a Mixed-Methods study (also classified as: Quantitative Study).

October 10, 2025Open Access

What Numbers Cannot Fully Reveal: Correlating CAFP Measures With Holistic Human Ratings and Examiners’ Insights in an Opinion-Based Monologic English-Speaking Test

Key Points

Higher scores in speaking assessments are predicted by accuracy, intelligible pronunciation, and fluency, but not by syntactic complexity.
Quantitative analysis of 76 L2 speakers highlighted that raters prioritize perceived communicative effectiveness over minor features.
A mixed-methods approach was employed, combining quantitative measures and qualitative insights to clarify rating discrepancies.
Results suggest a need for refinements in rater training and rubric design to improve fairness and validity in language assessments.

Abstract

Despite ongoing efforts to standardize foreign language (L2) speaking assessment, the validity and reliability of human ratings remains contested due to their inherent subjectivity and the limited transparency of underlying judgment processes. While prior quantitative research has explored correlations between rater scores and measures of complexity, accuracy, fluency, and pronunciation (CAFP), relatively few studies have examined why particular linguistic features carry more weight than others. Addressing this gap, the present study employed a multilayered mixed-methods approach to investigate the relationship between CAFP indices and holistic ratings in an opinion-based, monologic English-speaking test at a Malaysian university. Quantitative analysis of 76 L2 speakers’ performances revealed that global accuracy, intelligible pronunciation, fluency (characterized by fewer pauses and repairs), and lexical sophistication (use of academic vocabulary) were strong predictors of higher scores, whereas syntactic complexity, lexical density, lexical diversity, and speech rate exerted no influence. Qualitative data from think-aloud protocols, interviews, and observation notes showed raters tended to prioritize perceived communicative effectiveness and relied on salient features under cognitive load, often overlooking less prominent aspects. The findings underscore the need to refine rater training and rubric design to mitigate judgment bias and cognitive fatigue, thereby supporting fairness and validity in L2 speaking assessment. At a broader level, these results offer an evidence base to enhance rater calibration and scoring consistency in L2 speaking assessment worldwide.

What Numbers Cannot Fully Reveal: Correlating CAFP Measures With Holistic Human Ratings and Examiners’ Insights in an Opinion-Based Monologic English-Speaking Test

Key Points

Abstract

Cite This Study

Also Consider

Also Consider