Introduction and Objective: Current medical AI benchmarks rely on single-best-answer exam accuracy (e.g. USMLE), but real-world type 2 diabetes (T2D) care involves context-dependent clinical decisions with acceptable practice variability. To discriminate real-world clinical effectiveness of AI systems, we aimed to develop a diabetologist-validated framework in T2D management. Methods: We devised a Donabedian model-based framework to assess AI clinical decision capability by evaluating clinical reasoning for patient triage/problem list, medication recommendation, treatment strategy, dose adjustment, and monitoring/education. Meta-evaluation items embedded at the end of each phase assessed the framework’s ability to discriminate the clinical effectiveness of AI systems. Reviewers rated comprehensiveness (coverage of required elements in T2D care) and clarity (unambiguous interpretation and application) on a 4-point scale, and provided free-text feedback to inform between-round revisions. 12 diabetologists completed two initial Delphi rounds; 3 senior diabetologists led the final consensus review. Results: Delphi rounds 1-2 generated 102 item-level revision comments spanning validity, clarity, coverage, feasibility, and traceability. Iterative revisions streamlined the framework from 56 to 29 evaluation items by removing redundancy and sharpening workflow-aligned criteria, while increasing content validity index from 64.4%/51.1% (comprehensiveness/clarity) in the initial round to 100%/100% in the final round. Conclusion: This diabetologist consensus-validated framework provides explicit standards to systematically assess AI-generated T2D treatment recommendations across reasoning reliability, clinical utility, and real-world feasibility. The framework demonstrates potential to serve as an evaluative benchmark for distinguishing AI systems that effectively support diabetologists' treatment decision-making. Disclosure S. Baek: None. J. Kim: None. S. Jin: None. G. Kim: None. Y. Lee: None. J. Kim: None. S. Cho: None. R. Oh: None. B. Kim: None. M. Jang: None. S. Ko: None. M. Moon: None. K. Kim: None. K. Hur: None. Funding Future Medicine 2030 Project of the Samsung Medical Center (#SMX1250111); The Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2024-00357879)
BAEK et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: