Objective This study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human readers of varied experience. Methods Three radiologists with differing levels of experience (resident, fellow, expert) independently reviewed the reports and assigned PI-RADS v2.1 scores. The same reports were processed through prompts with the GPT-4o and Gemini 2.5 Pro. Inter-rater agreement was evaluated using Gwet’s AC1 coefficient, and the diagnostic performance was assessed using sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results Inter-rater agreement between human experts was highest between the expert and fellow (Gwet’s AC1 = 0.68, 95% CI 0.61-0.75), which was significantly higher than between two LLMs (Gwet’s AC1 = 0.52, 95% CI 0.44-0.59, P = 0.004). The agreement between expert and GPT (Gwet’s AC1 = 0.42, 95% CI 0.34-0.51) was lower than between expert and Gemini (Gwet’s AC1 = 0.49, 95% CI 0.41-0.57, P = 0.17). The AUCs for resident, fellow, and expert readers were 0.81 (95% CI 0.76-0.87), 0.86 (95% CI 0.81-0.91), and 0.89 (95% CI 0.85-0.93), and for GPT and Gemini were 0.85 (95% CI 0.81-0.90) and 0.84 (95% CI 0.80-0.89), respectively. Conclusion LLMs demonstrated promising performance in assigning PI-RADS scores from free-text prostate MRI reports, with accuracy and agreement approaching that of general radiologists; however, they are not yet ready to replace expert interpretation in high-stakes clinical settings. Nevertheless, these findings support its potential as a supplementary tool for report standardization and trainee education.
Wen et al. (Fri,) studied this question.