e13669 Background: Cancer Multi-Disciplinary Team Meetings (MDTMs) are central to UK Cancer Pathways irrespective of patient case complexities. A major bottleneck for the MDT is time-consuming, laborious manual review of clinical summaries and investigation reports to prepare MDTM cases. Large Language Models (LLMs) can extract critical, structured information from this vast, complex unstructured text reservoir. After a multiple LLM benchmark testing model in an internal sandbox environment as part of an Oncology Intelligence Platform, the most optimal LLM (maximal accuracy, minimal hallucinations and Graphics Processing Unit usage) was deployed for a Data Extraction task to prepare cases for the Breast Cancer MDTM. This was a prospective, external validation experience to highlight LLM performance. Methods: A retrospective dataset consisting of structured and unstructured radiology investigation text reports of confirmed breast cancer patients from the Barts Health NHS Trust Data Platform was obtained from 2018 to 2024. These reports were multisource including regional (Mammogram, Ultrasound and MRI Breast) and systemic scans (Staging/Response Assessment CT Chest Abdomen Pelvis and Bone Scans). An Artificial Intelligence (AI) powered Cancer MDTM CoPilot software platform (OncoflowTM) was used on this data to perform strategic extraction to a set of defined objective parameters, including clinical TNM (tumour-node-metastasis) classification points. Results: 165 aforementioned reports of varying disease stages (I to IV) were prospectively processed by OncoFlow’s fine tuned, cancer data extraction task specific LLM. This LLM was an open source, domain-specific, multilingual, instruction-tuned (having undergone distillation and reinforcement learning), autoregressive transformer model. There were 21 extraction features. These were divided into 3 Tiers based on data types - T1a (continuous numeric), T1b (discrete ordinal), T2 (categorical with intrinsic order), T3 (free text) comprising 2, 4, 3, 12 parameters respectively. Performance metrics for T1 features used Mean Absolute Error (MAE), which ranged from 96 to 99% for T1a and 74 to 93% for T1b. T2 being multi-class, used F1 scores, i.e., Micro-F1 (model performance on whole dataset/all classes) ranging 0.8 - 0.9 and Macro-F1 (average model performance across each class) ranging 0.6 - 0.8. Token-level F1 score, measuring precision and recall, was used in model performance for T3 parameters. This ranged from 78 to 95%. Exact match accuracy for the aforesaid was 62 to 93%. Conclusions: The LLM achieved robust, clinically relevant accuracy scores across all data tiers. The reliable scores showcase the model’s readiness to streamline and standardise MDTM case preparations.
Khanna et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: