Abstract Background The use of artificial intelligence and natural language processing technologies in healthcare services has gained significant momentum in recent years. Radiology, with its extensive textual content production and the need for report standardization, presents an ideal field of application for these technologies. This study aimed to evaluate the ability of four prominent large language models (LLMs) to accurately determine RADS scores from free-text radiology reports across five imaging modalities. Methods This retrospective cross-sectional study included 250 anonymized radiology reports obtained from a single institution between March 2024 and March 2025. Reports were drawn from thyroid ultrasound (TI-RADS), breast ultrasound and MRI (BI-RADS), prostate MRI (PI-RADS), and coronary computed tomography angiography (CAD-RADS), with 50 reports per modality. Each report was translated into English and reviewed by two radiologists to establish reference scores. The performances of ChatGPT-4o, Gemini 2.0, Claude 3.7, and Perplexity were evaluated in terms of accuracy, agreement (Cohen’s kappa), and critical misclassification rates. Results ChatGPT-4o achieved the highest overall accuracy (77.6%) and demonstrated good agreement with radiologists (κ = 0.72), followed by Claude 3.7 (64.4%, κ = 0.56), Gemini 2.0 (62.8%, κ = 0.53), and Perplexity (58.8%, κ = 0.48). Modality-specific analyses revealed the highest accuracy in CAD-RADS and BI-RADS (MRI), while the lowest performance was observed in TI-RADS. The critical misclassification rates were 6.8% for ChatGPT-4o, 10.8% for Claude 3.7, 12.0% for Gemini 2.0, and 14.0% for Perplexity. Conclusion LLMs show promising potential in supporting standardized radiology reporting, with ChatGPT-4o outperforming its counterparts across most metrics. However, limitations such as variability across modalities and non-negligible error rates highlight the need for continued refinement before clinical integration.
Özenbaş et al. (Wed,) studied this question.