e13701 Background: Large language models (LLMs) are increasingly evaluated for oncology clinical decision support; however, reported performance varies widely, and safety failures such as hallucinations and guideline misalignment remain poorly characterized across disease contexts. We conducted a multi-subtype, clinician-adjudicated evaluation to assess how evidence-source constraints influence safety-aligned performance. Methods: We curated 216 oncology clinical vignettes using a standardized tumor-board format spanning leukemia, breast cancer, gastrointestinal (GI) cancers, CNS metastases, and gynecologic oncology. Each vignette was evaluated using three systems: an unconstrained LLM (Output 1), an NCCN-anchored retrieval-augmented generation (RAG) configuration (Output 2), and a literature-anchored system (Output 3). Two board-certified oncologists independently scored each output using a modified Generative Performance Score (mGPS; range −1 to +1), incorporating guideline concordance and hallucination penalties. Readability and rationality were rated separately (Likert 1–5) and used for contextual interpretation. Overall disparity severity was conservatively assigned as the maximum severity across hallucination and guideline axes. Results: Across all vignettes, the NCCN-anchored RAG system achieved higher mean mGPS and lower hallucination penalties compared with unconstrained and literature-anchored systems. Safety performance varied substantially by disease subtype. Leukemia outputs demonstrated predominantly low to intermediate disparity with rare hallucination-driven high-risk events. Breast cancer outputs showed low-intermediate risk, with high-disparity cases driven primarily by biomarker-dependent guideline misalignment. GI cancers exhibited intermediate-to-high disparity, reflecting multidisciplinary complexity and biomarker omission. CNS metastases and gynecologic oncology represented the highest-risk domains, with frequent high-disparity classifications driven by combined hallucination and guideline failures despite fluent presentation. Readability was consistently moderate to high across systems but did not independently mitigate safety risks. Conclusions: Safety-aligned performance of oncology LLMs is highly disease-dependent and strongly influenced by evidence-source constraints. Guideline-anchored retrieval significantly reduces hallucination-related risk but does not fully mitigate failures in complex, multidisciplinary settings. Multi-axis, disease-specific evaluation frameworks are essential prior to clinical deployment of LLM-based decision support.
Yost et al. (Thu,) studied this question.