Background With the emergence of artificial intelligence in medical imaging, large language models such as chat generative pre-trained transformer (ChatGPT)-4o have drawn much attention for their potential in diagnostic support. However, their performance in nuclear medicine applications still remains underexplored. In this study, we aimed to evaluate the Taiwan Food and Drug Administration (TFDA)-approved bone scintigraphy (BS platform) and ChatGPT-4o capability to interpret BS images for the detection and localization of bone metastases. Methods A total of 52 BS images were analyzed with three interpretation methods: board-certified physicians, ChatGPT-4o multimodal image analysis, and the BS platform. The performance of the interpretations was evaluated with both binary classification and lesion localization of nine predefined anatomical regions. These results were compared to the report of board-certified nuclear medicine physicians, which served as the gold standard in this study. Results In binary classification, ChatGPT-4o achieved an accuracy of 84.6%, similar to the performance of the BS platform's accuracy of 82.7%. However, ChatGPT-4o showed lower performance in lesion localization. Its regional precision was 32.5%, and sensitivity was 13.3%, compared to the BS platform's precision of 80.3% and sensitivity of 64.9%. Conclusion ChatGPT-4o showed preliminary potential for detecting bone metastases and assisting in structured report drafting, but its limited lesion-localization performance restricts clinical applicability. The BS platform, developed specifically for bone scintigraphy, demonstrated more consistent regional accuracy in this dataset. These results represent an early proof-of-concept comparison, suggesting feasibility for reporting support rather than clinical deployment. Larger, multi-center studies and domain-specific training will be needed to clarify large language models’ future role in nuclear medicine.
Lee et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: