What question did this study set out to answer?

This study aims to evaluate the efficacy of five leading large language models in clinical decision-making for chronic myeloid leukemia.

May 29, 2026

Clinical fidelity of large language models in chronic myeloid leukemia: A multimodel comparative study.

Key Points

This study aims to evaluate the efficacy of five leading large language models in clinical decision-making for chronic myeloid leukemia.
Developed fifty standardized clinical vignettes on CML, focusing on key treatment topics.
Evaluated five LLMs under baseline and guideline-augmented prompting conditions.
Responses were scored by a blinded expert panel using a peer-reviewed answer key.
Guideline-augmented prompting increased accuracy across all models from 79.8% to 97.6%, mean difference 17.8 percentage points, 95% CI: 3.9-31.7, P=0.024.
Gemini-2.5-Pro achieved perfect accuracy (100%); individual model improvements ranged from 5 to 33 percentage points.
Staging errors occurred in 32% of baseline cases due to outdated WHO classification criteria.

Abstract

6585 Background: Large language models (LLMs) are emerging as promising tools for clinical decision-making. These models are increasingly investigated for their potential in diagnosis, prognosis, and personalized treatment of chronic myeloid leukemia (CML). This study evaluates the efficacy of five leading LLMs and examines the impact of incorporating up-to-date clinical guidelines into LLM prompts to enhance clinical decision-making. Methods: We developed fifty standardized clinical vignettes on CML, staging, tyrosine kinase inhibitor (TKI) selection, monitoring, adverse events, and treatment-free remission. Five LLMs (DeepSeek-R1, Gemini-2.5-Pro, Claude 4.0, GPT-OSS-120B, and LLAMA 4) generated responses under two distinct prompt conditions. In the baseline prompt, the clinical vignette was entered, and a response was obtained. In the guideline-augmented prompt, models were instructed to extract information from the NCCN v1.2026 and ESMO 2017 guidelines. A blinded expert panel evaluated the responses using a peer-reviewed answer key, assigning scores of 0 for incorrect, 1 for partial, and 2 for correct answers. The maximum possible score was 100. Results: The use of guideline-augmented prompting increased accuracy across all models, raising performance from 79.8% to 97.6% (mean difference: 17.8 percentage points, 95% CI: 3.9-31.7, p=0.024). Improvements for individual models ranged from 5 to 33 percentage points. Notably, only Gemini-2.5-Pro achieved perfect accuracy (100%). Knowledge drift was observed in four out of five models (80%), primarily due to reliance on outdated WHO classification criteria instead of current NCCN standards. This discrepancy resulted in staging errors in 32% of baseline cases. The diagnostic workup category exhibited the highest baseline accuracy (94%), whereas the treatment-free remission criteria showed the most substantial improvement following guideline integration, increasing from 72% to 96%. Conclusions: LLMs possess substantial medical knowledge that can facilitate clinical decision-making. However, these AI systems remain vulnerable to errors such as knowledge drift and hallucinations. This scalable framework demonstrates that mandatory compliance with guidelines, supported by structured prompting, is essential for safe LLM integration into oncology workflows. Institutions implementing LLM-based clinical decision support should require guideline integration rather than relying solely on pre-trained knowledge. Future studies should validate these findings in prospective clinical settings and evaluate mechanisms for automated guideline updating. Comparative accuracy of LLM responses under baseline vs. guideline-augmented prompts. LLM Baseline response accuracy Guideline augmented response accuracy Delta Claude 4.0 82 97 +15 DeepSeek-R1 72 97 +25 Gemini-2.5 Pro 89 100 +11 GPT-OSS-120B 90 95 +5 LLAMA-4 66 99 +33

Bookmark

Clinical fidelity of large language models in chronic myeloid leukemia: A multimodel comparative study.

Key Points

Abstract

Cite This Study

Also Consider

Also Consider