What question did this study set out to answer?

The research aims to automate the extraction of TNM staging information from lung cancer radiology reports using a hybrid system.

April 1, 2026Open Access

UTY at the NTCIR-18 RadNLP 2024 Task: Possibilities and Limitations of a Hybrid Rule-Based and LLM Approach for Lung Cancer TNM Classification

Key Points

The research aims to automate the extraction of TNM staging information from lung cancer radiology reports using a hybrid system.
Developed a two-stage pipeline combining large language models and rule-based processing.
Extracted structured information from reports using GPT-4o models.
Applied rule-based methods for T classification and LLM for N and M classifications.
Evaluated performance on validation and test datasets.
Achieved a joint accuracy of 0.8148 on the validation dataset.
Notable drop in T classification accuracy on test dataset, from 0.8704 to 0.4769.
N and M classifications maintained high accuracy levels indicating reliability.

Abstract

Automated extraction of TNM staging information from radiology reports is a challenging task that requires understanding complex clinical language and applying detailed staging criteria. In this paper, we present our approach to the NTCIR-18 RadNLP 2024 shared task on automated lung cancer staging from Japanese radiology reports. We developed a hybrid system that combines large language models (LLMs) with rule-based processing in a two-stage pipeline: first extracting structured information from reports using GPT-4o models, then applying classification rules to determine the appropriate TNM stages. Our approach employed different strategies for each classification component: a rule-based method for the complex T classification and a more flexible LLM-based approach for N and M classifications. Evaluation results showed strong performance on the validation dataset (joint accuracy of 0.8148) but revealed a significant drop in T classification performance on the test dataset (from 0.8704 to 0.4769), while N and M classifications maintained high accuracy levels. This performance disparity highlights the trade-offs between rule-based precision and LLM flexibility in clinical NLP systems. Our findings suggest that balancing these approaches and leveraging larger development datasets could improve the robustness of automated cancer staging systems for real-world clinical applications.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper