What question did this study set out to answer?

This research aims to evaluate the effectiveness of prompt engineering for TNM staging classification using large language models.

April 1, 2026Open Access

Ubie at the NTCIR-18 RadNLP Main Task: Few-shot Classification of TNM Staging for Japanese Radiology Reports Using LLMs

Key Points

This research aims to evaluate the effectiveness of prompt engineering for TNM staging classification using large language models.
Participated in the RadNLP core task for lung cancer staging based on Japanese radiology reports.
Compared multiple proprietary models with various prompt configurations such as few-shot and chain-of-thought.
Analyzed performance improvements and impacts of different prompt strategies on classified outcomes.
Model evolution led to significant performance improvements in medical text classification.
Explicitly prompting reasoning steps resulted in substantial gains for non-reasoning models.
Self-feedbacked instruction showed no improvement for some models, suggesting variability in effectiveness.

Abstract

The Ubie team participated in the RadNLP core task on lung cancer staging classification based on Japanese radiology reports at NTCIR-18. This paper reports our approach and analyzes the official results. We investigated the impact of prompt engineering on TNM classification using large language models (LLMs). We compared multiple proprietary models available as of January 2025 (Gemini 1.5 Pro, Gemini Exp. 1206, and o1) using various prompt configurations, including zero-shot, few-shot, chain-of-thought (CoT), and self-feedbacked instruction. The results demonstrate significant performance improvements driven by model evolution in this medical text classification task. Analysis of prompt variations revealed differential impacts based on model capabilities. For Gemini models tested, explicitly prompting reasoning steps (CoT) led to the most substantial performance gains. In contrast, the o1 model, a reasoning model performing internal CoT and self-evaluation, showed limited benefit from explicit reasoning prompts, suggesting that strategies effective for non-reasoning models are less critical for advanced reasoning models. This finding, consistent with general guidance on prompting reasoning models, is also observed in our medical text classification experiments. The effectiveness of self-feedbacked instruction varied, showing no improvement for Gemini 1.5 Pro, possibly due to inadequate feedback generation and its dependence on factors like few-shot example selection. While prompt engineering offered limited gains for the reasoning model evaluated, it provided substantial performance benefits for non-reasoning models, highlighting its value for optimizing models without inherent advanced reasoning capabilities.

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper