What does this research mean for the field?

A hybrid method combining rule-based natural language processing and selective large language model use achieves 97-99% sensitivity and over 99% specificity for real-time cancer case ascertainment. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to develop a real-time cancer data mart for timely cancer case ascertainment.

May 30, 2026

Building a real-time cancer data mart (CDM) for an integrated health system, leveraging traditional natural language processing (NLP) methods, curated data sources, and large language models (LLMs).

Key Points

The study aims to develop a real-time cancer data mart for timely cancer case ascertainment.
Identified 230,500 patients with potential cancer diagnoses within a set timeframe.
Utilized eMaRC, a rule-based NLP model, and MedGemma, a generative LLM, for patient data analysis.
Evaluated accuracy using sensitivity, specificity, PPV, and NPV for various cancer types.
High specificity (>99%) was achieved by both NLP-only and hybrid methods.
Sensitivity improved from 95% to 99% for prostate cancer and 94% to 97% for lung cancer with the hybrid method.
Overall sensitivity for breast and prostate cancers reached 91%, with lower reliability in lung cancer classification.

Abstract

e13654 Background: Timely cancer case ascertainment is critical for supporting clinical trials, research, and operational workflows. Despite being the gold standard, information on people diagnosed with cancer from accredited cancer registries is often delayed well over one-year post-diagnosis. Advances in LLMs offer promising solutions but require high computational and GPU power. A novel strategy combining rule-based NLP with curated data sources and selective LLM use may offer a practical alternative. We developed a real-time CDM and evaluated two approaches: an NLP-only method and a hybrid method for identifying incident cancer cases and classifying key oncology characteristics. Methods: We identified 230,500 patients with no cancer history who had pathology reports or cancer diagnoses between 07/01/2023 and 12/31/2023 in Kaiser Permanente Northern California (KPNC). All records were processed using eMaRC, a CDC rule-based NLP model widely used by cancer registries. For the hybrid method, pathology reports indicating malignancy or suspicious findings were further analyzed using MedGemma, a generative LLM trained on medical datasets and applied using structured prompts. Both models classified malignancy, primary site, histology, and behavior. When models disagreed on, MedGemma results were prioritized. We evaluated accuracy by comparing cancer cases against those recorded in KPNC Cancer Registry, which conforms to SEER/NAACCR standards. Sensitivity, specificity, PPV, and NPV were calculated for 1631 breast, 851 colorectal, 677 lung, 687 melanoma, and 1358 prostate cancers cases, which together represent 60% of annual cases recorded in the registry. Results: Both methods – eMaRC only and hybrid approach – demonstrated high specificity ( > 99%). Adding MedGemma to eMaRC resulted sensitivities of 97%–99% across all cancer types, an average 2% increase over eMaRC alone. Sensitivity improved the most for prostate cancer (95% to 99%), followed by lung cancer (94% to 97%). Among 230,500 patients, true case prevalence ranged from 0.3% to 0.7%. Using the hybrid approach, was 91% for breast and prostate, ranged from 84%-86% for melanoma and colorectal, and lowest for lung at 71%. In practice, cases classified as breast and prostate are generally reliable, while lung cancer classification needed further review due to a higher rate of false positives. Conclusions: A hybrid method combining rule-based NLP, curated data sources, and selective LLM use achieved overall high sensitivity and specificity. Although less detailed than the cancer registry and prone to misclassification of certain cancers, the CDM’s rapid and automated process is highly scalable and efficient, making it valuable for operational use and research requiring rapid case identifications.

Bookmark

Cite This Study

Zhu et al. (Thu,) studied this question.

synapsesocial.com/papers/6a1a812b0307b78509433132 https://doi.org/https://doi.org/10.1200/jco.2026.44.16_suppl.e13654

Bookmark