The Ubie team participated in the RadNLP core task on lung cancer staging classification based on Japanese radiology reports at NTCIR-18. This paper reports our approach and analyzes the official results. We investigated the impact of prompt engineering on TNM classification using large language models (LLMs). We compared multiple proprietary models available as of January 2025 (Gemini 1.5 Pro, Gemini Exp. 1206, and o1) using various prompt configurations, including zero-shot, few-shot, chain-of-thought (CoT), and self-feedbacked instruction. The results demonstrate significant performance improvements driven by model evolution in this medical text classification task. Analysis of prompt variations revealed differential impacts based on model capabilities. For Gemini models tested, explicitly prompting reasoning steps (CoT) led to the most substantial performance gains. In contrast, the o1 model, a reasoning model performing internal CoT and self-evaluation, showed limited benefit from explicit reasoning prompts, suggesting that strategies effective for non-reasoning models are less critical for advanced reasoning models. This finding, consistent with general guidance on prompting reasoning models, is also observed in our medical text classification experiments. The effectiveness of self-feedbacked instruction varied, showing no improvement for Gemini 1.5 Pro, possibly due to inadequate feedback generation and its dependence on factors like few-shot example selection. While prompt engineering offered limited gains for the reasoning model evaluated, it provided substantial performance benefits for non-reasoning models, highlighting its value for optimizing models without inherent advanced reasoning capabilities.
Nishibayashi et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: