What question did this study set out to answer?

This study assesses whether ChatGPT-5 can reliably support decision-making in gynecologic oncology tumor boards.

May 29, 2026

Feasibility and concordance of a large language model (ChatGPT-5) as a clinical decision support tool in gynecologic oncology tumor boards: A blinded, multi-observer study.

Key Points

This study assesses whether ChatGPT-5 can reliably support decision-making in gynecologic oncology tumor boards.
Analyzed 97 gynecologic cancer cases using AI recommendations compared to tumor board evaluations.
Used a 5-point Likert scale for recommendation appropriateness evaluated by two blinded oncologists.
Assessed reproducibility by querying cases at three time points on the same day.
High inter-rater reliability observed for both MDT and AI evaluations (κ=0.748 and κ=0.802, p<0.001).
Fair concordance between MDT and ChatGPT-5 (Rater 1: κ=0.267; Rater 2: κ=0.341).
AI demonstrated significantly lower performance, especially in cases needing genetic testing and novel therapies.

Abstract

5519 Background: The integration of artificial intelligence (AI) into oncology practice is accelerating; however, the reliability of Large Language Models (LLMs) in complex clinical decision-making remains insufficiently validated. This study evaluated the concordance of ChatGPT-5 with multidisciplinary tumor board (MDT) decisions in gynecologic oncology, assessing accuracy, reproducibility across repeated queries, and specific domains of discordance. Methods: We analyzed 97 gynecologic cancer cases requiring multimodal treatment planning, evaluated at the Cukurova University Faculty of Medicine Gynecologic Oncology MDT between 2024-2025. Cases included ovarian (n=34), endometrial (n=41), cervical (n=16), and rare tumors (n=6). Standardized clinical summaries (staging, pathology, molecular markers, comorbidities, patient preferences) were input using a structured prompt template. Each case was queried independently at three time points within the same day to assess reproducibility. Recommendations were evaluated by two blinded medical oncologists using a 5-point Likert scale (1=completely inappropriate, 5=completely appropriate). A composite performance score was calculated as (mean Likert score)/5×100. Inter-rater reliability and concordance were analyzed using Cohen's kappa (κ). Results: Inter-rater reliability was high for both MDT (κ=0.748, p<0.001) and AI evaluations (κ=0.802, p<0.001). Concordance between MDT and ChatGPT-5 recommendations was fair (Rater 1: κ=0.267; Rater 2: κ=0.341), indicating frequent disagreement in specific clinical nuances despite similar overall quality scores. Mean performance scores were significantly higher for MDT versus AI (Rater 1: 94.2%±4.8 vs. 89.8%±6.3, p<0.001; Rater 2: 93.8%±5.1 vs. 90.1%±5.9, p<0.001). Crucially, ChatGPT-5 demonstrated full consistency across three queries in only 38% of cases (37/97), with a mean reproducibility score of 4.10±0.83. Subgroup analysis revealed superior AI performance in early-stage (I–II) versus advanced-stage (III–IV) disease (p=0.024). However, AI performance was significantly inferior in cases requiring genetic testing recommendations (p=0.019), fertility-sparing approaches (p=0.045), and novel therapeutics integration (p=0.012). Conclusions: ChatGPT-5 demonstrates potential as a clinical decision support tool but currently lacks sufficient reliability for independent use in complex gynecologic malignancies. Key limitations include inconsistent reproducibility (62% variability across queries), suboptimal performance in advanced-stage disease, and deficiencies in precision oncology domains. Human expertise remains essential to mitigate risks associated with AI-generated inaccuracies, particularly for novel therapeutic integration.

Mark Helpful

Bookmark

Relay