e12555 Background: As large language models (LLMs) are increasingly used as decision-support tools in breast oncology, their comparative performance in management plan guidance remains unclear. We evaluated whether breast medical oncologists perceive measurable quality differences across LLMs and which models are preferred. Methods: We created 11 synthetic breast cancer vignettes representing common clinical scenarios with specified tumor characteristics, stage, prior therapy, performance status, and medical context. Six LLMs were queried August 10-15, 2025 for management plans: ChatGPT-5 Fast, ChatGPT-5 Thinking, Claude Sonnet 4, DeepSeek-V3, Grok 4, and OpenEvidence. In an electronic survey, breast medical oncologists at University of California, San Francisco and affiliated sites were shown two de-identified, randomized treatment plans per vignette and prompted to select the superior option. We estimated relative performance using Elo ratings (starting 1500; K=32), which summarize pairwise preferences where higher scores indicate greater preference probability. We calculated 95% confidence intervals using 1000 bootstrap resamples. Results: Five oncologists completed 49 head-to-head comparisons across 11 vignettes. ChatGPT-5 Thinking and Grok achieved the highest Elo ratings and were preferred in 68.8% (11/16) and 66.7% (12/18) of comparisons, respectively. Remaining models: DeepSeek 53.3% (8/15), ChatGPT-5 Fast 50.0% (6/12), OpenEvidence 38.1% (8/21), and Claude 25.0% (4/16). Bootstrap 95% CIs supported this ranking. The rank order was the same using unbootstrapped Elo ratings. Conclusions: Breast medical oncologists consistently preferred management plans from ChatGPT-5 Thinking and Grok, demonstrating measurable perceived quality differences across LLM-generated treatment recommendations. Large language model rankings by Elo rating for breast cancer management recommendations. Large Language Model(in order of highest to lowest Elo rating and win rate) Elo Rating (Bootstrap Mean, 95% CI) Win Rate ChatGPT-5 Thinking 1573.3 (1486.2-1658.7) 11/16 wins (68.8%) Grok 4 1571.9 (1488.8-1649.8) 12/18 wins (66.7%) DeepSeek-V3 1505.6 (1423.7-1587.6) 8/15 wins (53.3%) ChatGPT-5 Fast 1496.1 (1421.2-1568.6) 6/12 wins (50.0%) OpenEvidence 1448.9 (1362.9-1536.1) 8/21 wins (38.1%) Claude Sonnet 4 1404.2 (1332.4-1486.1) 4/16 wins (25.0%)
Tsai et al. (Thu,) studied this question.