Abstract Background: Large Language Models (LLMs) have shown promising accuracy in generating guideline-based cancer treatment recommendations. However, their alignment with real-world breast cancer management decisions remains poorly characterized. This study presents a novel benchmark evaluating LLM-generated recommendations against both clinical guidelines and actual treatments recorded in electronic health records (EHR), while also assessing the models’ capacity for patient-centered communication. Methods: We extracted structured and unstructured EHR data from 500 newly diagnosed non-metastatic breast cancer patients (2023-2025) at Penn Medicine, including imaging data. Each case was processed through a multi-turn, agent-based LLM prompting framework using a range of models, including general-purpose LLMs (GPT-4o, Claude 3.7, DeepSeek R1, Grok), as well as medically tuned models (MedGemma, HealthBench). Each model received basic case data and dynamically queried for additional clinical information before recommending a treatment plan. Recommendations were compared to (1) NCCN/ESMO guidelines and (2) the actual treatment received (ATR). Concordance rates, over-/under-treatment frequencies, and the “Reality Gap” (cases where recommendations matched guidelines but diverged from real-world practice) were analyzed. Propensity score-based causal inference was used to identify demographic and socioeconomic drivers of discordance. Additionally, a multidisciplinary panel rated LLM responses to common patient questions in terms of clarity, empathy, and shared decision-making support. Results: Across all models, guideline concordance was high (GPT-4o: 90.2%, Claude 3.7: 89.4%, Deepseek: 88.6%, MedGemma: 87.9%). However, concordance with real-world treatment was notably lower (GPT-4o: 67.8%, Claude 3.7: 65.1%, MedGemma: 62.3%), with LLMs generally recommending more intensive treatment than was actually administered. In 22.3% of cases, the LLM recommendation aligned with guidelines but diverged from the real-world decision, reflecting a significant “Reality Gap.” Multivariable analysis identified Medicaid insurance (OR 2.14, 95% CI 1.31-3.49), age ≥70 (OR 1.72, 95% CI 1.08-2.75), and documented treatment refusal as independent predictors of discordance. In the communication task (N=100 dialogues), GPT-4o showed relatively strong performance in accuracy (4.61/5) and decision support (4.55/5), though differences between models were modest. Conclusions: This study directly addresses a central question in AI-enabled oncology: how far are large language models (LLMs) from actual clinical decision-making? While LLMs are capable of generating medically sound and guideline-concordant recommendations, real-world factors—such as patient preferences, socioeconomic barriers, and health literacy—are often overlooked, despite playing a critical role in shaping care delivery. By jointly evaluating clinical validity and contextual fidelity, our framework quantifies the “reality gap” between LLM recommendations and actual treatment. Importantly, we do not advocate for LLMs to conform to all real-world deviations, nor should models be encouraged to offer suboptimal plans based on non-clinical attributes. Instead, our findings highlight the importance of using social context to enhance interpretation and patient-centered communication—without compromising evidence-based care. Our work provides a concrete model for evaluating and improving AI alignment with real-world clinical decisions in breast cancer care. Further analyses are underway to expand the scope and robustness of these findings. Citation Format: Z. Qu, X. Wang, S. Pei, Y. Fang. Bridging the Reality Gap: A Multimodal Evaluation of Large Language Models for Real-World Breast Cancer Decision Support abstract. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-04-21.
Qu et al. (Tue,) studied this question.