We participated in the NTCIR-18 RadNLP2024 shared task 1 and investigated the automation of TNM classification using large language models (LLMs), specifically GPT-4o-mini, GPT-4o, and o1-mini. Our approach integrates cosine similarity-based retrieval using embedding vectors and few-shot learning to enhance classification accuracy. As a result of the experiment, o1-mini achieved the highest classification accuracy. However, the accuracy on the test data declined by approximately 30% compared to the validation data. In particular, the low classification accuracy of the T factor highlighted challenges in interpreting tumor size and extent of infiltration. In this paper, we analyze these results and report our approach to this task along with official results.
Mori et al. (Fri,) studied this question.