What does this research mean for the field?

A locally-deployable, schema-first Large Language Model pipeline can accurately and reliably extract registry-grade structured data from free-text surgical pathology reports across multiple cancer types while preserving privacy. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to develop a robust schema for extracting cancer data from surgical pathology reports using a localized AI approach.

May 29, 2026Open Access

Digital Registrar: A Schema-First Framework for Multi-Cancer Privacy-Preserving Pathology Abstraction via Local LLMs

Key Points

This research aims to develop a robust schema for extracting cancer data from surgical pathology reports using a localized AI approach.
Developed a clinical ontology aligned with CAP guidelines for 10 cancer types and 192 scalar fields.
Benchmarked a DSPy-based extraction pipeline on 893 internal reports against a pathologist-adjudicated gold standard.
External validation performed on 242 TCGA reports, confirming feasibility on a 48-GB GPU for privacy operations.
Achieved 92.0% macro-mean exact-match accuracy on internal data with the gpt-oss-20b model.
Critical prognostic indicators like ER/PR showed high fidelity (ER/PR accuracy 98.7%).
Accuracy on external TCGA data was 77.5%, improving to 88.0% by excluding structurally silent fields.

Abstract

Background/Objectives: Free-text surgical pathology reports hinder automated cancer registry entry and secondary analytics. This study introduces a clinically governed schema layer for interoperability, testing whether a locally-deployable Large Language Model (LLM) pipeline can deliver robust registry-grade extraction across institutions. Methods: We developed a College of American Pathologists (CAP)-aligned clinical ontology encompassing 10 cancer types, 192 per-organ scalar fields, key biomarkers, and nested structures for lymph nodes and margins. Encoded via Declarative Self-improving Python (DSPy) signatures with grammar-constrained decoding using DSPy v3.2.1, this model-agnostic pipeline was benchmarked on 893 internal reports against a pathologist-adjudicated gold standard. External validation utilized 242 The Cancer Genome Atlas (TCGA) reports. Hardware feasibility was confirmed on a single 48-gigabyte (GB) Graphics Processing Unit (GPU), ensuring suitability for privacy-preserving on-premises deployment. Results: Using the gpt-oss-20b model, the framework achieved 92.0% macro-mean exact-match accuracy on internal data, demonstrating near-perfect run-to-run reliability. Critical prognostic indicators, including breast estrogen receptor/progesterone receptor (ER/PR) (98.7%) and margin positivity (>93%), maintained high fidelity. On the external TCGA cohort, accuracy was 77.5%, rising to 88.0% after excluding structurally silent fields absent in older narratives. Operationally, the model processed reports in 40–70 s, optimally balancing speed and accuracy. Conclusions: This schema-first abstraction layer successfully decouples clinical logic from specific Artificial Intelligence (AI) models. By reliably transforming narrative reports into machine-readable structures, it establishes a portable privacy-preserving foundation for automated cancer surveillance, institutional data reuse, and future multimodal clinical systems.

Digital Registrar: A Schema-First Framework for Multi-Cancer Privacy-Preserving Pathology Abstraction via Local LLMs

Key Points

Abstract

Cite This Study