What question did this study set out to answer?

This study aims to assess large language models' (LLMs) alignment with real-world breast cancer treatment recommendations and patient communication effectiveness.

February 22, 2026

Abstract PS3-04-21: Bridging the Reality Gap: A Multimodal Evaluation of Large Language Models for Real-World Breast Cancer Decision Support

Key Points

This study aims to assess large language models' (LLMs) alignment with real-world breast cancer treatment recommendations and patient communication effectiveness.
Extracted EHR data from 500 newly diagnosed non-metastatic breast cancer patients.
Utilized a multi-turn, agent-based LLM prompting framework with a range of general-purpose and medically tuned models.
Compared LLM recommendations against NCCN/ESMO guidelines and actual treatments using concordance rates and causal inference methods.
High guideline concordance (90.2%-87.9%) across models, but lower alignment with real-world treatment (67.8%-62.3%).
22.3% of recommendations aligned with guidelines but diverged from actual treatment, indicating a significant 'Reality Gap'.
Identified Medicaid insurance, age ≥70, and documented treatment refusal as independent factors influencing discordance.

Abstract

Abstract Background: Large Language Models (LLMs) have shown promising accuracy in generating guideline-based cancer treatment recommendations. However, their alignment with real-world breast cancer management decisions remains poorly characterized. This study presents a novel benchmark evaluating LLM-generated recommendations against both clinical guidelines and actual treatments recorded in electronic health records (EHR), while also assessing the models’ capacity for patient-centered communication. Methods: We extracted structured and unstructured EHR data from 500 newly diagnosed non-metastatic breast cancer patients (2023-2025) at Penn Medicine, including imaging data. Each case was processed through a multi-turn, agent-based LLM prompting framework using a range of models, including general-purpose LLMs (GPT-4o, Claude 3.7, DeepSeek R1, Grok), as well as medically tuned models (MedGemma, HealthBench). Each model received basic case data and dynamically queried for additional clinical information before recommending a treatment plan. Recommendations were compared to (1) NCCN/ESMO guidelines and (2) the actual treatment received (ATR). Concordance rates, over-/under-treatment frequencies, and the “Reality Gap” (cases where recommendations matched guidelines but diverged from real-world practice) were analyzed. Propensity score-based causal inference was used to identify demographic and socioeconomic drivers of discordance. Additionally, a multidisciplinary panel rated LLM responses to common patient questions in terms of clarity, empathy, and shared decision-making support. Results: Across all models, guideline concordance was high (GPT-4o: 90.2%, Claude 3.7: 89.4%, Deepseek: 88.6%, MedGemma: 87.9%). However, concordance with real-world treatment was notably lower (GPT-4o: 67.8%, Claude 3.7: 65.1%, MedGemma: 62.3%), with LLMs generally recommending more intensive treatment than was actually administered. In 22.3% of cases, the LLM recommendation aligned with guidelines but diverged from the real-world decision, reflecting a significant “Reality Gap.” Multivariable analysis identified Medicaid insurance (OR 2.14, 95% CI 1.31-3.49), age ≥70 (OR 1.72, 95% CI 1.08-2.75), and documented treatment refusal as independent predictors of discordance. In the communication task (N=100 dialogues), GPT-4o showed relatively strong performance in accuracy (4.61/5) and decision support (4.55/5), though differences between models were modest. Conclusions: This study directly addresses a central question in AI-enabled oncology: how far are large language models (LLMs) from actual clinical decision-making? While LLMs are capable of generating medically sound and guideline-concordant recommendations, real-world factors—such as patient preferences, socioeconomic barriers, and health literacy—are often overlooked, despite playing a critical role in shaping care delivery. By jointly evaluating clinical validity and contextual fidelity, our framework quantifies the “reality gap” between LLM recommendations and actual treatment. Importantly, we do not advocate for LLMs to conform to all real-world deviations, nor should models be encouraged to offer suboptimal plans based on non-clinical attributes. Instead, our findings highlight the importance of using social context to enhance interpretation and patient-centered communication—without compromising evidence-based care. Our work provides a concrete model for evaluating and improving AI alignment with real-world clinical decisions in breast cancer care. Further analyses are underway to expand the scope and robustness of these findings. Citation Format: Z. Qu, X. Wang, S. Pei, Y. Fang. Bridging the Reality Gap: A Multimodal Evaluation of Large Language Models for Real-World Breast Cancer Decision Support abstract. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-04-21.

Mark Helpful

Bookmark

Relay