Background: The reliability of large language models (LLMs) in complex oncologic decision-making remains inadequately validated. This study evaluated the concordance of ChatGPT-5 with multidisciplinary tumor board (MDT) decisions in gynecologic oncology, assessing accuracy, reproducibility, and domains of discordance. Methods: We analyzed 242 gynecologic cancer cases (endometrial n = 102, ovarian n = 85, cervical n = 40, rare n = 15) discussed at the Çukurova University Gynecologic Oncology MDT (2024–2025). Standardized clinical summaries were input into ChatGPT-5 using a structured prompt template. Each case was queried three times within a single calendar day using independent conversations. Recommendations were evaluated by two blinded medical oncologists using a 5-point Likert scale. A composite performance score (CPS) was calculated as (mean Likert/5) × 100. Concordance was analyzed using Cohen’s kappa (κ). Results: Inter-rater reliability was substantial to almost perfect for both MDT (κ = 0.761) and AI (κ = 0.814) evaluations (both p < 0.001). MDT–AI concordance was fair (Rater 1: κ = 0.258; Rater 2: κ = 0.334). CPS were significantly higher for MDT versus AI (Rater 1: 93.8% ± 5.2 vs. 89.4% ± 6.7; Rater 2: 93.4% ± 5.5 vs. 89.7% ± 6.4; both p < 0.001). Full consistency across three queries was achieved in only 37.2% of cases (90/242). AI performance was significantly inferior in advanced-stage disease (p = 0.008), genetic testing (p = 0.006), fertility-sparing (p = 0.018), and novel therapeutics (p = 0.003). Conclusions: ChatGPT-5 demonstrates potential as a clinical decision support tool but lacks sufficient reliability for independent use. Key limitations include inconsistency in 62.8% of cases, suboptimal performance in advanced-stage disease, and deficiencies in precision oncology domains. These findings suggest that human expertise remains indispensable for the individualized management of complex gynecologic malignancies.
Asoglu et al. (Tue,) studied this question.