6585 Background: Large language models (LLMs) are emerging as promising tools for clinical decision-making. These models are increasingly investigated for their potential in diagnosis, prognosis, and personalized treatment of chronic myeloid leukemia (CML). This study evaluates the efficacy of five leading LLMs and examines the impact of incorporating up-to-date clinical guidelines into LLM prompts to enhance clinical decision-making. Methods: We developed fifty standardized clinical vignettes on CML, staging, tyrosine kinase inhibitor (TKI) selection, monitoring, adverse events, and treatment-free remission. Five LLMs (DeepSeek-R1, Gemini-2.5-Pro, Claude 4.0, GPT-OSS-120B, and LLAMA 4) generated responses under two distinct prompt conditions. In the baseline prompt, the clinical vignette was entered, and a response was obtained. In the guideline-augmented prompt, models were instructed to extract information from the NCCN v1.2026 and ESMO 2017 guidelines. A blinded expert panel evaluated the responses using a peer-reviewed answer key, assigning scores of 0 for incorrect, 1 for partial, and 2 for correct answers. The maximum possible score was 100. Results: The use of guideline-augmented prompting increased accuracy across all models, raising performance from 79.8% to 97.6% (mean difference: 17.8 percentage points, 95% CI: 3.9-31.7, p=0.024). Improvements for individual models ranged from 5 to 33 percentage points. Notably, only Gemini-2.5-Pro achieved perfect accuracy (100%). Knowledge drift was observed in four out of five models (80%), primarily due to reliance on outdated WHO classification criteria instead of current NCCN standards. This discrepancy resulted in staging errors in 32% of baseline cases. The diagnostic workup category exhibited the highest baseline accuracy (94%), whereas the treatment-free remission criteria showed the most substantial improvement following guideline integration, increasing from 72% to 96%. Conclusions: LLMs possess substantial medical knowledge that can facilitate clinical decision-making. However, these AI systems remain vulnerable to errors such as knowledge drift and hallucinations. This scalable framework demonstrates that mandatory compliance with guidelines, supported by structured prompting, is essential for safe LLM integration into oncology workflows. Institutions implementing LLM-based clinical decision support should require guideline integration rather than relying solely on pre-trained knowledge. Future studies should validate these findings in prospective clinical settings and evaluate mechanisms for automated guideline updating. Comparative accuracy of LLM responses under baseline vs. guideline-augmented prompts. LLM Baseline response accuracy Guideline augmented response accuracy Delta Claude 4.0 82 97 +15 DeepSeek-R1 72 97 +25 Gemini-2.5 Pro 89 100 +11 GPT-OSS-120B 90 95 +5 LLAMA-4 66 99 +33
Gehlawat et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: