Background As generative AI becomes embedded in medical training, patient safety depends on graduates’ ability to recognize AI limitations and bias, document AI involvement transparently, and verify AI-generated information rather than accept it uncritically. We developed a performance-based rubric to assess observable generative AI (LLM) literacy behaviors within authentic coursework. Methods In a single-institution evaluation (Spring 2025), third-year medical students ( n = 50 submissions) completed a structured research proposal and submitted the corresponding AI chat transcript and an AI-use disclosure. A four-domain rubric was developed through three pilot–revise cycles: AI Use Documentation, Prompt Generation, Verification, and Integration. Each domain was scored 0–3 (total 0-12). Three educators independently scored all submissions. Inter-rater reliability was assessed using ICC (average-measures, agreement). Construct-relevant patterns were examined via domain distributions (floor effects), performance bands (lower 25%, middle 50%, upper 25%), within-submission differences across domains (Friedman with Bonferroni-adjusted Wilcoxon tests), inter-domain associations (Spearman), and correlation with overall GPA (Spearman). Results Mean (SD) domain scores were: AI Use Documentation 0.67 (1.08), Prompt Generation 1.33 (0.69), Verification 0.41 (0.71), and Integration 1.64 (0.67); total score 4.06 (1.80). Floor effects were substantial for AI Use Documentation (64% scored 0) and Verification (60% scored 0). Inter-rater reliability was high (ICC: Documentation 0.99, Prompt Generation 0.84, Verification 0.93, Integration 0.83). Verification was significantly lower than Prompt Generation and Integration (Bonferroni-adjusted p < 0.008). Inter-domain correlations were weak ( ρ −0.206 to 0.310). Total scores showed no significant association with GPA ( r = 0.194, p = 0.201). Conclusions This rubric demonstrated strong scoring reliability and produced initial psychometric evidence consistent with measuring distinct, observable LLM-use competencies. Findings highlight prominent gaps in verification and transparent documentation, reinforcing competency guidance that emphasizes recognizing AI limitations and verifying AI output to protect patient safety. Further multi-site validation and implementation work is warranted.
Shiukashvili et al. (Fri,) studied this question.