What question did this study set out to answer?

May 15, 2026Open Access

Evaluation of the performance of large language models in determining RADS scores from radiology reports

Key Points

This study aims to assess the performance of large language models in determining RADS scores from radiology reports.
Retrospective cross-sectional analysis of 250 anonymized radiology reports from March 2024 to March 2025.
Four large language models evaluated: ChatGPT-4o, Gemini 2.0, Claude 3.7, and Perplexity.
Accuracy and agreement evaluated through Cohen’s kappa and misclassification rates.
ChatGPT-4o achieved the highest accuracy at 77.6% with a Cohen’s kappa of 0.72.
Claude 3.7 followed with 64.4% accuracy (κ = 0.56), and Gemini 2.0 had 62.8% accuracy (κ = 0.53).
Critical misclassification rates were 6.8% for ChatGPT-4o and up to 14.0% for Perplexity.

Abstract

Abstract Background The use of artificial intelligence and natural language processing technologies in healthcare services has gained significant momentum in recent years. Radiology, with its extensive textual content production and the need for report standardization, presents an ideal field of application for these technologies. This study aimed to evaluate the ability of four prominent large language models (LLMs) to accurately determine RADS scores from free-text radiology reports across five imaging modalities. Methods This retrospective cross-sectional study included 250 anonymized radiology reports obtained from a single institution between March 2024 and March 2025. Reports were drawn from thyroid ultrasound (TI-RADS), breast ultrasound and MRI (BI-RADS), prostate MRI (PI-RADS), and coronary computed tomography angiography (CAD-RADS), with 50 reports per modality. Each report was translated into English and reviewed by two radiologists to establish reference scores. The performances of ChatGPT-4o, Gemini 2.0, Claude 3.7, and Perplexity were evaluated in terms of accuracy, agreement (Cohen’s kappa), and critical misclassification rates. Results ChatGPT-4o achieved the highest overall accuracy (77.6%) and demonstrated good agreement with radiologists (κ = 0.72), followed by Claude 3.7 (64.4%, κ = 0.56), Gemini 2.0 (62.8%, κ = 0.53), and Perplexity (58.8%, κ = 0.48). Modality-specific analyses revealed the highest accuracy in CAD-RADS and BI-RADS (MRI), while the lowest performance was observed in TI-RADS. The critical misclassification rates were 6.8% for ChatGPT-4o, 10.8% for Claude 3.7, 12.0% for Gemini 2.0, and 14.0% for Perplexity. Conclusion LLMs show promising potential in supporting standardized radiology reporting, with ChatGPT-4o outperforming its counterparts across most metrics. However, limitations such as variability across modalities and non-negligible error rates highlight the need for continued refinement before clinical integration.

AIに質問

Bookmark

View Full Paper