What question did this study set out to answer?

The aim is to evaluate and compare GPT-4o and Gemini 2.5 Pro's ability to extract PI-RADS v2.1 scores from free-text prostate MRI reports against human readers.

April 14, 2026Open Access

Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers

Key Points

The aim is to evaluate and compare GPT-4o and Gemini 2.5 Pro's ability to extract PI-RADS v2.1 scores from free-text prostate MRI reports against human readers.
Three radiologists with varying experience levels assessed MRI reports and assigned PI-RADS scores.
The reports were also processed using GPT-4o and Gemini 2.5 Pro for score extraction.
Inter-rater agreement was calculated using Gwet’s AC1 coefficient, and diagnostic performance was evaluated with sensitivity, specificity, and AUC.
Inter-rater agreement was highest between expert and fellow radiologists (Gwet’s AC1 = 0.68).
Agreement between LLMs was lower (GPT: Gwet’s AC1 = 0.42; Gemini: Gwet’s AC1 = 0.49).
AUC values were 0.89 for experts, 0.86 for fellows, and 0.81 for residents; LLMs achieved AUCs of 0.85 and 0.84.

Abstract

Objective This study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human readers of varied experience. Methods Three radiologists with differing levels of experience (resident, fellow, expert) independently reviewed the reports and assigned PI-RADS v2.1 scores. The same reports were processed through prompts with the GPT-4o and Gemini 2.5 Pro. Inter-rater agreement was evaluated using Gwet’s AC1 coefficient, and the diagnostic performance was assessed using sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Results Inter-rater agreement between human experts was highest between the expert and fellow (Gwet’s AC1 = 0.68, 95% CI 0.61-0.75), which was significantly higher than between two LLMs (Gwet’s AC1 = 0.52, 95% CI 0.44-0.59, P = 0.004). The agreement between expert and GPT (Gwet’s AC1 = 0.42, 95% CI 0.34-0.51) was lower than between expert and Gemini (Gwet’s AC1 = 0.49, 95% CI 0.41-0.57, P = 0.17). The AUCs for resident, fellow, and expert readers were 0.81 (95% CI 0.76-0.87), 0.86 (95% CI 0.81-0.91), and 0.89 (95% CI 0.85-0.93), and for GPT and Gemini were 0.85 (95% CI 0.81-0.90) and 0.84 (95% CI 0.80-0.89), respectively. Conclusion LLMs demonstrated promising performance in assigning PI-RADS scores from free-text prostate MRI reports, with accuracy and agreement approaching that of general radiologists; however, they are not yet ready to replace expert interpretation in high-stakes clinical settings. Nevertheless, these findings support its potential as a supplementary tool for report standardization and trainee education.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers

Key Points

Abstract

Cite This Study