Background: Manual annotation of free-text radiology reports is time-consuming and costly, delaying real-world evidence (RWE) studies in oncology.This study aimed to evaluate the performance of large language models (LLMs) in annotating cancer progression from Danish free-text radiology reports.The objectives were to determine whether human-to-LLM inter-rater agreement was non-inferior to human-to-human agreement, establish human intra-rater agreement, and develop a framework for tuning LLM performance to RWE needs. Materials and methods:We identified 376 radiology reports from 184 patients with metastatic breast cancer from Danish electronic health records.Six human annotators, including two experts, classified radiology reports as progressive disease (PD) or non-PD.A 'reverse questioning' strategy was used to evaluate five LLM model series (Mistral, Gemma, Gemma 2, Llama 3, and Llama 3.1).Bootstrapping estimated confidence intervals (CIs) and assessed non-inferiority of the best-performing LLM ensemble compared with human agreement, using a non-inferiority margin of 0.1. Results:The LLM framework was non-inferior to human annotators with a mean Cohen's kappa of 0.82 (95% CI 0.74-0.89)for human-to-LLM versus 0.79 (95% CI 0.71-0.86)for human-to-human agreement (P < 0.001).The bestperforming ensemble model, Llama 3.1:70B, achieved 100% sensitivity, a specificity of 90%, and an F1 score of 84% on the test set.The mean human intra-rater variability was 0.87.Conclusions: The proposed LLM framework was non-inferior to human annotators in classifying cancer progression from free-text radiology reports.This offers significant potential for using LLMs as a tool for identifying tumor progression events in clinical assessment and research.
Kristjánsson et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: