e13647 Background: Multidisciplinary reviews (MDR) can alter the management of cancer cases. We previously presented more than 400 real-world cases across 5 cancer types to an expert MDR panel comprising radiology, medical, surgical, radiation, and hematologic oncology. Here, we evaluate the alignment and competence of recommendations generated by 3 leading AI models relative to those made by an MDR panel. Methods: We reviewed 261 complex cases in breast, lung, heme, gastrointestinal (GI), and genitourinary (GU) cancers previously adjudicated by an MDR panel between 2020 and 2021 from a larger cancer database. Cases were analyzed by AI models (OpenAI’s ChatGPT 4.5, Anthropic’s Claude Opus 4 and Google’s Gemini Ultra) using PrecisCa’s proprietary prompting method. Individual AI-generated recommendations from each model were scored on a scale of 1-5 (5 highest) across 6 domains: completeness, reasoning, clarity, menu of options, recency, and relevance versus the MDR panel recommendations. The maximum achievable score was 30 per case, yielding a total achievable aggregate score of 7,830. Final AI recommendations were also compared to National Comprehensive Cancer Network (NCCN) guidelines for discrepancies. Reverse comparisons of additional AI-recommended options not identified by the MDR panel were not performed due to interval updates in the past 5 years. Results: Across the board (Table 1), AI systems excelled in recency but not in completeness. While variability existed among the 3 AI models, alignment with MDR expert recommendations was high. Discordant cases reflected minor differences in option selection and were unlikely to have resulted in clinically meaningful changes in management. Conclusions: This study demonstrates a high degree of alignment between recommendations generated by 3 leading AI models and those of a MDR panel across multiple complex cancer cases. These findings support the potential role of AI as a clinical decision support tool when used in conjunction with human experts’ review, rather than as a replacement for multidisciplinary care. Characteristics and aggregate/median competence score (range) by cancer type. Cancer Type n Histology (%) ChatGPT 4.5 Claude Opus 4 Gemini Ultra Breast 70 Ductal 90; Lobular 10 1868/25.5 (21-30) 1940/23.5 (17-30) 1965/25.5 (21-30) Lung 70 Non-small cell 92.9; Small cell 7.1 1860/25 (20-30) 1942/25 (20-30) 1971/26 (22-30) Heme 38 Hodgkin lymphoma 13.2; Leukemia 10.5; Multiple myeloma 36.8; Non-Hodgkin lymphoma 39.5 849/22.5 (15-30) 932/22.5 (15-30) 964/22.5 (15-30) GI 48 Anal 6.25; Colorectal 43.8; Esophageal 12.5; Gastric 6.25; Hepatobiliary 10.4; Pancreatic 20.8 1231/20.5 (11-30) 1249/20.5 (11-30) 1264/21.5 (13-30) GU 35 Bladder 20; Kidney 31.4; Prostate 42.9; Testicular 5.7 880/23.5 (17-30) 931/23.5 (17-30) 889/24 (18-30) Total 261 6688/20.5 (11-30) 6994/20.5 (11-30) 7053/21.5 (13-30)
Jahanzeb et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: