What type of study is this?

This is a Validation Study study.

What question did this study set out to answer?

To evaluate ChatGPT's ability to provide evidence-based responses aligned with clinical guidelines for breast cancer.

March 23, 2026Open Access

The potential of ChatGPT as an artificial intelligence enhancement therapy consultant for patients with breast cancer

Key Points

To evaluate ChatGPT's ability to provide evidence-based responses aligned with clinical guidelines for breast cancer.
Designed tests to assess ChatGPT responses to breast cancer-related questions.
Administered 30 validated questions across three iterative trials to GPT-3.5 and GPT-4.0.
Compared responses against those from breast surgeons of varying expertise levels.
GPT-4.0 outperformed GPT-3.5 in most metrics for breast cancer questions.
Both GPT models matched the responses of less qualified breast surgeons.
Statistical analysis revealed no significant difference in mean scores for most items.

Abstract

Background OpenAI developed ChatGPT as an advanced artificial intelligence (AI)-driven natural language processing system. ChatGPT is capable of generating responses through statistical pattern recognition established during pretraining. Objective To ascertain whether ChatGPT could respond to patients with breast cancer in a way that was consistent with evidence-based medical practices and a breast cancer clinical guideline. This guideline was a practical pocket book based on the latest evidence and took into account the national data, and to evaluate the ability of AI to provide accurate and up-to-date information to patients, potentially serving as a supplementary resource for medical professionals. Methods The research team designed a series of tests to assess the responses of ChatGPT to specific questions related to breast cancer diagnosis, treatment options, and post-treatment care. Thirty clinically validated breast cancer questions spanning diagnosis, prognosis, treatment, and pharmacotherapy were administered through three iterative trials to: (1) GPT-3.5/GPT-4.0 (5min interval between trials) and (2) three breast surgeons stratified by expertise (high/medium/low). Responses were scored dichotomously (1 = guideline-consistent; 0 = inconsistent) with total scores ranging 0 to 3 per question. For each consistent and inconsistent answer with the standard answer, 1 and 0 points were given, respectively. The sum of the answers obtained from the three experts resulted in a score of 0 to 3. Data analysis included mean score comparisons (analysis of variance with post hoc Tukey tests), subgroup analyses by question category, and inter-rater reliability assessment. Results Performance comparison between GPT-3.5 and GPT-4.0 across breast surgery subspecialties and question types revealed that GPT-4.0 generally outperformed GPT-3.5, despite the absence of significant difference in the mean scores for most items. We found that GPT-3.5 and have the same medical response ability as lower qualified breast surgeons, while GPT-4.0 have the same ability as higher qualified breast surgeons.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper