What question did this study set out to answer?

The aim is to develop and evaluate a performance-based rubric to assess generative AI literacy among medical students.

May 8, 2026Open Access

A Performance-Based Rubric for Generative AI use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation

Key Points

The aim is to develop and evaluate a performance-based rubric to assess generative AI literacy among medical students.
Developed a four-domain rubric with scoring from 0-3 over three pilot-revise cycles, established inter-rater reliability among three educators.
Analyzed submissions from 50 third-year medical students with a structured research proposal and AI-use disclosures.
Conducted statistical evaluations including ICC for reliability, Spearman correlations, and Friedman tests for domain performance variability.
Mean total score was 4.06 (SD 1.80) with significant gaps in AI Use Documentation (64% scored 0) and Verification (60% scored 0).
High inter-rater reliability was observed for all domains, particularly significant for Documentation (ICC 0.99).
No significant correlation between total scores and overall GPA (r = 0.194, p = 0.201).

Abstract

Background As generative AI becomes embedded in medical training, patient safety depends on graduates’ ability to recognize AI limitations and bias, document AI involvement transparently, and verify AI-generated information rather than accept it uncritically. We developed a performance-based rubric to assess observable generative AI (LLM) literacy behaviors within authentic coursework. Methods In a single-institution evaluation (Spring 2025), third-year medical students ( n = 50 submissions) completed a structured research proposal and submitted the corresponding AI chat transcript and an AI-use disclosure. A four-domain rubric was developed through three pilot–revise cycles: AI Use Documentation, Prompt Generation, Verification, and Integration. Each domain was scored 0–3 (total 0-12). Three educators independently scored all submissions. Inter-rater reliability was assessed using ICC (average-measures, agreement). Construct-relevant patterns were examined via domain distributions (floor effects), performance bands (lower 25%, middle 50%, upper 25%), within-submission differences across domains (Friedman with Bonferroni-adjusted Wilcoxon tests), inter-domain associations (Spearman), and correlation with overall GPA (Spearman). Results Mean (SD) domain scores were: AI Use Documentation 0.67 (1.08), Prompt Generation 1.33 (0.69), Verification 0.41 (0.71), and Integration 1.64 (0.67); total score 4.06 (1.80). Floor effects were substantial for AI Use Documentation (64% scored 0) and Verification (60% scored 0). Inter-rater reliability was high (ICC: Documentation 0.99, Prompt Generation 0.84, Verification 0.93, Integration 0.83). Verification was significantly lower than Prompt Generation and Integration (Bonferroni-adjusted p < 0.008). Inter-domain correlations were weak ( ρ −0.206 to 0.310). Total scores showed no significant association with GPA ( r = 0.194, p = 0.201). Conclusions This rubric demonstrated strong scoring reliability and produced initial psychometric evidence consistent with measuring distinct, observable LLM-use competencies. Findings highlight prominent gaps in verification and transparent documentation, reinforcing competency guidance that emphasizes recognizing AI limitations and verifying AI output to protect patient safety. Further multi-site validation and implementation work is warranted.

A Performance-Based Rubric for Generative AI use in Medical Students’ Research Tasks: Development and Initial Psychometric Evaluation

Key Points

Abstract

Cite This Study