What question did this study set out to answer?

This study assesses breast oncologists' preferences among various large language models for treatment recommendation.

May 29, 2026

Physician preferences in large language models for breast cancer management.

Key Points

This study assesses breast oncologists' preferences among various large language models for treatment recommendation.
Evaluated six LLMs using 11 synthetic breast cancer vignettes representing clinical scenarios.
Conducted an electronic survey where oncologists selected superior treatment plans from randomized comparisons.
Estimated LLM performance using Elo ratings based on pairwise preferences.
ChatGPT-5 Thinking preferred in 68.8% of comparisons with an Elo rating of 1573.3 (95% CI: 1486.2-1658.7).
Grok 4 preferred in 66.7% of comparisons with an Elo rating of 1571.9 (95% CI: 1488.8-1649.8).
Other models showed significantly lower win rates, with Claude Sonnet 4 at 25.0% and an Elo rating of 1404.2 (95% CI: 1332.4-1486.1).

Abstract

e12555 Background: As large language models (LLMs) are increasingly used as decision-support tools in breast oncology, their comparative performance in management plan guidance remains unclear. We evaluated whether breast medical oncologists perceive measurable quality differences across LLMs and which models are preferred. Methods: We created 11 synthetic breast cancer vignettes representing common clinical scenarios with specified tumor characteristics, stage, prior therapy, performance status, and medical context. Six LLMs were queried August 10-15, 2025 for management plans: ChatGPT-5 Fast, ChatGPT-5 Thinking, Claude Sonnet 4, DeepSeek-V3, Grok 4, and OpenEvidence. In an electronic survey, breast medical oncologists at University of California, San Francisco and affiliated sites were shown two de-identified, randomized treatment plans per vignette and prompted to select the superior option. We estimated relative performance using Elo ratings (starting 1500; K=32), which summarize pairwise preferences where higher scores indicate greater preference probability. We calculated 95% confidence intervals using 1000 bootstrap resamples. Results: Five oncologists completed 49 head-to-head comparisons across 11 vignettes. ChatGPT-5 Thinking and Grok achieved the highest Elo ratings and were preferred in 68.8% (11/16) and 66.7% (12/18) of comparisons, respectively. Remaining models: DeepSeek 53.3% (8/15), ChatGPT-5 Fast 50.0% (6/12), OpenEvidence 38.1% (8/21), and Claude 25.0% (4/16). Bootstrap 95% CIs supported this ranking. The rank order was the same using unbootstrapped Elo ratings. Conclusions: Breast medical oncologists consistently preferred management plans from ChatGPT-5 Thinking and Grok, demonstrating measurable perceived quality differences across LLM-generated treatment recommendations. Large language model rankings by Elo rating for breast cancer management recommendations. Large Language Model(in order of highest to lowest Elo rating and win rate) Elo Rating (Bootstrap Mean, 95% CI) Win Rate ChatGPT-5 Thinking 1573.3 (1486.2-1658.7) 11/16 wins (68.8%) Grok 4 1571.9 (1488.8-1649.8) 12/18 wins (66.7%) DeepSeek-V3 1505.6 (1423.7-1587.6) 8/15 wins (53.3%) ChatGPT-5 Fast 1496.1 (1421.2-1568.6) 6/12 wins (50.0%) OpenEvidence 1448.9 (1362.9-1536.1) 8/21 wins (38.1%) Claude Sonnet 4 1404.2 (1332.4-1486.1) 4/16 wins (25.0%)

Bookmark

Cite This Study

Tsai et al. (Thu,) studied this question.

synapsesocial.com/papers/6a192f2dfab5b468c4418982 https://doi.org/https://doi.org/10.1200/jco.2026.44.16_suppl.e12555

Bookmark