What question did this study set out to answer?

This study aims to evaluate the reliability of ChatGPT-5 in supporting clinical decisions in gynecologic oncology tumor boards by assessing its concordance with human experts.

June 11, 2026Open Access

Feasibility and Concordance of a Large Language Model (ChatGPT-5) as a Clinical Decision Support Tool in Gynecologic Oncology Tumor Boards: A Blinded, Multi-Observer Study

Key Points

This study aims to evaluate the reliability of ChatGPT-5 in supporting clinical decisions in gynecologic oncology tumor boards by assessing its concordance with human experts.
Analyzed 242 gynecologic cancer cases discussed at Çukurova University Gynecologic Oncology MDT (2024–2025)
Used standardized summaries input into ChatGPT-5 with structured prompts and evaluated by two blinded oncologists
Calculated concordance and inter-rater reliability using Cohen’s kappa and assessed performance score through a 5-point Likert scale.
Substantial to almost perfect inter-rater reliability for both MDT (κ = 0.761) and AI (κ = 0.814); p < 0.001
Lower concordance rates between MDT and AI evaluations (Rater 1: κ = 0.258; Rater 2: κ = 0.334)
AI was significantly inferior in advanced-stage disease (p = 0.008) and other oncology domains.

Abstract

Background: The reliability of large language models (LLMs) in complex oncologic decision-making remains inadequately validated. This study evaluated the concordance of ChatGPT-5 with multidisciplinary tumor board (MDT) decisions in gynecologic oncology, assessing accuracy, reproducibility, and domains of discordance. Methods: We analyzed 242 gynecologic cancer cases (endometrial n = 102, ovarian n = 85, cervical n = 40, rare n = 15) discussed at the Çukurova University Gynecologic Oncology MDT (2024–2025). Standardized clinical summaries were input into ChatGPT-5 using a structured prompt template. Each case was queried three times within a single calendar day using independent conversations. Recommendations were evaluated by two blinded medical oncologists using a 5-point Likert scale. A composite performance score (CPS) was calculated as (mean Likert/5) × 100. Concordance was analyzed using Cohen’s kappa (κ). Results: Inter-rater reliability was substantial to almost perfect for both MDT (κ = 0.761) and AI (κ = 0.814) evaluations (both p < 0.001). MDT–AI concordance was fair (Rater 1: κ = 0.258; Rater 2: κ = 0.334). CPS were significantly higher for MDT versus AI (Rater 1: 93.8% ± 5.2 vs. 89.4% ± 6.7; Rater 2: 93.4% ± 5.5 vs. 89.7% ± 6.4; both p < 0.001). Full consistency across three queries was achieved in only 37.2% of cases (90/242). AI performance was significantly inferior in advanced-stage disease (p = 0.008), genetic testing (p = 0.006), fertility-sparing (p = 0.018), and novel therapeutics (p = 0.003). Conclusions: ChatGPT-5 demonstrates potential as a clinical decision support tool but lacks sufficient reliability for independent use. Key limitations include inconsistency in 62.8% of cases, suboptimal performance in advanced-stage disease, and deficiencies in precision oncology domains. These findings suggest that human expertise remains indispensable for the individualized management of complex gynecologic malignancies.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Asoglu et al. (Tue,) studied this question.

synapsesocial.com/papers/6a2a515980c8f91e7f39d9b3 https://doi.org/https://doi.org/10.3390/jcm15124451

Bookmark

View Full Paper