What question did this study set out to answer?

This evaluation aims to assess the safety and performance of large language models (LLMs) in oncology by examining their alignment with clinical guidelines across various disease subtypes.

May 30, 2026

Safety-aligned evaluation of large language models for oncology clinical decision support across disease subtypes.

Key Points

This evaluation aims to assess the safety and performance of large language models (LLMs) in oncology by examining their alignment with clinical guidelines across various disease subtypes.
Conducted a multi-subtype evaluation of 216 oncology clinical vignettes spanning five cancer types.
Evaluated outputs from an unconstrained LLM, NCCN-anchored retrieval-augmented generation, and a literature-anchored system.
Two oncologists independently scored each output using a modified Generative Performance Score and rated readability and rationality.
The NCCN-anchored RAG system achieved higher mean mGPS scores and lower hallucination penalties than other systems.
Safety performance varied by disease subtype, with leukemia outputs showing low to intermediate disparity.
CNS metastases and gynecologic oncology had the highest risk classifications due to guideline failures and hallucinations.

Abstract

e13701 Background: Large language models (LLMs) are increasingly evaluated for oncology clinical decision support; however, reported performance varies widely, and safety failures such as hallucinations and guideline misalignment remain poorly characterized across disease contexts. We conducted a multi-subtype, clinician-adjudicated evaluation to assess how evidence-source constraints influence safety-aligned performance. Methods: We curated 216 oncology clinical vignettes using a standardized tumor-board format spanning leukemia, breast cancer, gastrointestinal (GI) cancers, CNS metastases, and gynecologic oncology. Each vignette was evaluated using three systems: an unconstrained LLM (Output 1), an NCCN-anchored retrieval-augmented generation (RAG) configuration (Output 2), and a literature-anchored system (Output 3). Two board-certified oncologists independently scored each output using a modified Generative Performance Score (mGPS; range −1 to +1), incorporating guideline concordance and hallucination penalties. Readability and rationality were rated separately (Likert 1–5) and used for contextual interpretation. Overall disparity severity was conservatively assigned as the maximum severity across hallucination and guideline axes. Results: Across all vignettes, the NCCN-anchored RAG system achieved higher mean mGPS and lower hallucination penalties compared with unconstrained and literature-anchored systems. Safety performance varied substantially by disease subtype. Leukemia outputs demonstrated predominantly low to intermediate disparity with rare hallucination-driven high-risk events. Breast cancer outputs showed low-intermediate risk, with high-disparity cases driven primarily by biomarker-dependent guideline misalignment. GI cancers exhibited intermediate-to-high disparity, reflecting multidisciplinary complexity and biomarker omission. CNS metastases and gynecologic oncology represented the highest-risk domains, with frequent high-disparity classifications driven by combined hallucination and guideline failures despite fluent presentation. Readability was consistently moderate to high across systems but did not independently mitigate safety risks. Conclusions: Safety-aligned performance of oncology LLMs is highly disease-dependent and strongly influenced by evidence-source constraints. Guideline-anchored retrieval significantly reduces hallucination-related risk but does not fully mitigate failures in complex, multidisciplinary settings. Multi-axis, disease-specific evaluation frameworks are essential prior to clinical deployment of LLM-based decision support.

Mark Helpful

Bookmark

Relay