Strong business skills—such as communication, professional judgment, and stakeholder management—have become a key differentiator for actuarial trainees entering the workplace and are correlated with future success. While case studies have historically been proven effective at developing these skills, existing resources are limited and typically structured as multi-week team projects that are difficult to scale, individualize, or align with specific competencies. To address this gap, this paper examines whether AI models can (i) efficiently transform a small set of comprehensive actuarial cases into many brief, single-competency, individual assessments; and (ii) score these assessments with adequate psychometric quality. Using 144 AI-generated assessments covering the Society of Actuaries’ eight core competencies, we achieve strong reliability (G=0.719, 0.740) with optimized three- and four-grader panels, respectively, selected through Generalizability Theory analysis. Our experiments reveal that iterative prompt refinement improves assessment quality, with later prompts outperforming initial versions and representing a medium-sized effect. However, we document critical challenges: all AI graders exhibit in-group bias, systematically favoring assessments generated by their own model family despite anonymization. Additionally, graders may engage in algorithmic gaming, producing low entropy scoring patterns with strong halo effects that bear no relationship to actual assessment quality. The exclusion of unreliable graders from a model family partially explains the apparent underperformance of assessments from that same family, illustrating how grader selection can inadvertently create bias. We propose a hybrid approach combining carefully selected AI grader panels with human moderators to address these documented biases while leveraging the efficiency gains of automated assessment.
Orfanos et al. (Mon,) studied this question.