ChatGPT (GPT-4.1) was numerically but not statistically superior to oncologists in per-patient binary accuracy for survival prediction (25.4% vs 20.0% win rate; p=0.299).
Observational (n=205)
No
Does ChatGPT improve prognostic accuracy compared to oncologists and a SEER-based calculator in adult cancer patients?
ChatGPT demonstrated comparable or superior prognostic estimates for clinically relevant time points compared to oncologists, particularly in long-term outcomes for advanced disease.
Absolute Event Rate: 25.4% vs 20%
p-value: p=0.299
1621 Background: Prognostic estimates guide cancer treatment planning and goals-of-care discussions. Clinicians often rely on population-based survival statistics (e.g., SEER), which may not reflect individualized risk. Large language models (LLMs) such as ChatGPT may offer more personalized estimates, but their performance relative to oncologists and population-based tools remains unclear. Methods: We conducted a retrospective comparative study of 205 adult cancer patients treated in a safety net cancer clinic. For each patient, one deidentified clinical note from the time of diagnosis was provided to a HIPPAA compliant instance of ChatGPT (GPT-4.1) and an oncologist unfamiliar with the patient. Both generated binary (alive/deceased) and probabilistic (0–100%) predictions of survival at 6 months, 1, 2, and 5 years. The primary endpoint was per-patient binary accuracy (0–4 correct timepoints) comparison between ChatGPT and the oncologist. Secondary outcomes (n=189) included Brier scores, and calibration metrics compared with oncologists and a publicly available SEER-based cancer-specific survival calculator (CancerSurvivalRates.com), and subgroup analyses by cancer stage. Significance testing used exact binomial methods for per-patient win–loss comparisons and paired nonparametric tests to compare probabilistic performance across methods. Results: Of the 205 patients, 25, 53, 53, and 74 were stages I, II, III, and IV, respectively. Gender was balanced (53% male). All surviving patients had at least 5 years of follow-up. Notes varied in final staging and treatment plans, as some patients were in their initial evaluation. In the primary analysis (N = 205), ChatGPT was numerically, not statistically, superior to the oncologist (52 vs. 41, p=0.299). In secondary analyses (n = 189), ChatGPT had superior overall accuracy with lower Brier scores at 1 year (p < 0.001) and 2 years (p = 0.030). Calibration analyses showed that at 5 years, ChatGPT achieved near-ideal reliability (calibration slope 1.018), whereas oncologists demonstrated overconfidence (slope 0.535). As expected, cancer specific survival by CSR was significantly higher than OS estimates from oncologists or ChatGPT. Stage-stratified analyses revealed oncologist superiority in Stage I disease (p = 0.036), while ChatGPT significantly outperformed oncologists in Stage IV disease at 2- and 5-year horizons (both p < 0.001). Conclusions: For a safety-net cancer clinic, using one unstructured note, ChatGPT demonstrated comparable or superior prognostic estimates for clinically relevant time points as compared to oncologists, particularly in long-term outcomes for advanced disease. Future studies should evaluate cancer specific survival and prognosis after or during treatment.
Huang et al. (Wed,) conducted a observational in Cancer (n=205). ChatGPT (GPT-4.1) vs. Oncologist and SEER-based survival calculator was evaluated on Per-patient binary accuracy (0-4 correct timepoints) win-loss comparison (p=0.299). ChatGPT (GPT-4.1) was numerically but not statistically superior to oncologists in per-patient binary accuracy for survival prediction (25.4% vs 20.0% win rate; p=0.299).