October 19, 2025Open Access

Automated Esophageal Cancer Staging From Free-Text Radiology Reports: Large Language Model Evaluation Study

Key Points

Automated esophageal cancer staging achieved an overall accuracy of 61.5%, surpassing clinicians' accuracy of 39.5%.
INF-72B+IR demonstrated an F1-score of 0.60, compared to 0.39 for clinicians (P<.001).
The study utilized a dataset of 1134 free-text radiology reports from 200 patients undergoing surgery for esophageal cancer.
The findings suggest that LLMs can effectively aid in esophageal cancer staging, offering transparent reasoning processes for clinical decisions.

Abstract

Abstract Background Accurate staging of esophageal cancer is crucial for determining prognosis and guiding treatment strategies, but manual interpretation of radiology reports by clinicians is prone to variability and limited accuracy, resulting in reduced staging accuracy. Recent advances in large language models (LLMs) have shown promise in medical applications, but their utility in esophageal cancer staging remains underexplored. Objective This study aims to compare the performance of 3 locally deployed LLMs (INF-72B, Qwen2.5-72B, and LLaMA3.1-70B) and clinicians in preoperative esophageal cancer staging using free-text radiology reports. Methods This retrospective study included 200 patients from Shanghai Chest Hospital who underwent esophageal cancer surgery from May to December 2024. The dataset consisted of 1134 Chinese free-text radiology reports. The reference standard was derived from postoperative pathological staging. A total of 3 LLMs determined tumor classification (T1-T4), node classification (N0-N3), and overall staging (I-IV) using 3 prompting strategies (zero-shot, chain-of-thought, and a proposed interpretable reasoning IR method). The McNemar test and Pearson chi-square test were used for comparisons. Results INF-72B+IR achieved a superior overall staging accuracy of 61.5% and an F 1 -score of 0.60, substantially higher than the clinicians’ accuracy of 39.5% and F 1 -score of 0.39 (all P 0.5) Conclusions This study demonstrates that LLMs, particularly when guided by the proposed IR strategy, can accurately and reliably perform esophageal cancer staging from free-text radiology reports. This approach not only provides high-performance predictions but also offers a transparent and verifiable reasoning process, highlighting its potential as a valuable decision-support tool to augment human expertise in complex clinical diagnostic tasks.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper