Motivation: The potential of large language models (LLMs) in automating complex medical tasks, such as TNM staging from breast cancer DCE-MRI reports, remains unexplored. Goal(s): To evaluate and compare the effectiveness of ChatGPT 4.0, ChatGPT 3.5, and Google Bard in automating TNM staging using zero-shot and few-shot learning approaches. Approach: We analyzed 745 DCE-MRI reports using different LLMs and learning strategies, assessing intra- and inter-LLM agreement, accuracy, and AUC. Results: ChatGPT 4.0 demonstrated superior performance (AUC: 0.89 in few-shot learning) compared to other models. Few-shot learning significantly improved all models' performance, with Bard showing the largest improvement (14.8 percentage points increase in AUC). Impact: This study demonstrates the potential of LLMs, especially ChatGPT 4.0, in automating breast cancer TNM staging from DCE-MRI reports. The effectiveness of few-shot learning suggests a pathway for rapid adaptation of AI in radiology, potentially enhancing diagnostic efficiency and accuracy.
Xu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: