What question did this study set out to answer?

This research aims to assess the scoring effectiveness of large language models compared to human raters and machine learning models in employment interviews.

June 21, 2026Open Access

Scoring employment interviews with large language models: Evaluation design components, validity investigations, and best practice recommendations.

Key Points

This research aims to assess the scoring effectiveness of large language models compared to human raters and machine learning models in employment interviews.
Evaluated intrarater reliability, test-retest correlations, and various validity forms including convergent and criterion evidence.
Compared large language model scoring to human and supervised machine learning models.
Provided best practice recommendations for organizations adopting large language models.
Larger, newer large language models demonstrated psychometric properties comparable to or superior to human raters (p < 0.05).
Evidence of reliability and validity was established, suggesting favorable outcomes for detailed construct prompts.
Caution is advised for high-stakes assessments despite promising scores from models.

Abstract

= 144). We then investigated the LLM scores' intrarater reliabilities, test-retest correlations, convergent, discriminant, and criterion evidence of validity, group differences, and measurement bias. We compared this evidence, when possible, to the same evidence for human raters and supervised machine learning models. The results suggest that ensembles of larger, newer LLMs using prompts with detailed construct information hold potential for scoring employment interviews with psychometric properties comparable to or superior to supervised machine learning models and single human raters. We detail the reasons that organizations may want to be cautious in adopting LLMs for scoring high-stakes open-ended assessments, but since organizations have already begun adopting them, we also offer best practice recommendations. (PsycInfo Database Record (c) 2026 APA, all rights reserved).

Scoring employment interviews with large language models: Evaluation design components, validity investigations, and best practice recommendations.

Key Points

Abstract

Cite This Study

Also Consider

Also Consider