What question did this study set out to answer?

This study aims to evaluate large language models for annotating cancer progression in radiology reports and compare their performance with human annotators.

March 13, 2026Open Access

Prompting large language models and evaluating inter- and intra-rater agreement for cancer progression assessment from radiology reports

Key Points

This study aims to evaluate large language models for annotating cancer progression in radiology reports and compare their performance with human annotators.
Analyzed 376 radiology reports from 184 patients with metastatic breast cancer.
Six human annotators classified reports as progressive disease or non-progressive disease.
Utilized a 'reverse questioning' strategy to assess five LLM models.
Bootstrapping estimated confidence intervals for non-inferiority comparisons.
Examined human intra-rater agreement for consistency.
The LLM framework achieved a mean Cohen's kappa of 0.82, indicating non-inferiority to human agreement (0.79).
Llama 3.1 model demonstrated 100% sensitivity, 90% specificity, and 84% F1 score.
Mean intra-rater variability among human annotators was 0.87.

Abstract

Background: Manual annotation of free-text radiology reports is time-consuming and costly, delaying real-world evidence (RWE) studies in oncology.This study aimed to evaluate the performance of large language models (LLMs) in annotating cancer progression from Danish free-text radiology reports.The objectives were to determine whether human-to-LLM inter-rater agreement was non-inferior to human-to-human agreement, establish human intra-rater agreement, and develop a framework for tuning LLM performance to RWE needs. Materials and methods:We identified 376 radiology reports from 184 patients with metastatic breast cancer from Danish electronic health records.Six human annotators, including two experts, classified radiology reports as progressive disease (PD) or non-PD.A 'reverse questioning' strategy was used to evaluate five LLM model series (Mistral, Gemma, Gemma 2, Llama 3, and Llama 3.1).Bootstrapping estimated confidence intervals (CIs) and assessed non-inferiority of the best-performing LLM ensemble compared with human agreement, using a non-inferiority margin of 0.1. Results:The LLM framework was non-inferior to human annotators with a mean Cohen's kappa of 0.82 (95% CI 0.74-0.89)for human-to-LLM versus 0.79 (95% CI 0.71-0.86)for human-to-human agreement (P < 0.001).The bestperforming ensemble model, Llama 3.1:70B, achieved 100% sensitivity, a specificity of 90%, and an F1 score of 84% on the test set.The mean human intra-rater variability was 0.87.Conclusions: The proposed LLM framework was non-inferior to human annotators in classifying cancer progression from free-text radiology reports.This offers significant potential for using LLMs as a tool for identifying tumor progression events in clinical assessment and research.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper