e13664 Background: Large Language Models (LLMs) hold significant promise as a support tool in oncology decision-making. However, large-scale studies measuring LLMs’ performance in this area are still lacking. Methods: At an Italian annual review course in medical oncology ( www.grandangolo.org ), a digital audience-response system was used to let the attendees vote on multiple-choice questions on challenging clinical cases covering 8 major cancers. Each cancer-specific session was chaired by an internationally recognized faculty of experts. Four independent reviewers captured the faculty’s discussion to generate the reference database of answers, which was later used to compare responses originating from the audience and from ChatGPT (OpenAI, GPT-5.2 Thinking, Jan 2026). ChatGPT was zero-shot prompted to generate the most appropriate answer, rank the correct ones, if more than one were acceptable, and identify the wrong answers, if any. The faculty’s and ChatGPT’s preferred 1st, 2nd, other acceptable choices, and wrong answers, were identified by the reviewers. Results: Among the 550 attendees, the median number of oncologists who voted was 158 (range 86-258). A total of 51 out of 53 consecutively-presented, multiple-choice clinical cases, comprising 210 answer options, were fully addressed and voted on by the audience. Across cases, the median % of oncologists whose selected answer matched the faculty’s 1st or 1st/2nd choices was 60% (range 6-97%) and 82% (range 7-99%), respectively. ChatGPT’s responses matched the faculty’s 1st or 1st/2nd choices in 57% (29/51) and 75% (38/51) of cases, respectively. The median % of oncologists selecting the wrong answer was 7% (range 0-93%), while ChatGPT selected the wrong answer in 16% (8/51) of the cases. These 8 cases were among those most frequently missed by the audience (wrong response rate, range 25-93%), attributed by reviewers to unclear case descriptions or drug reimbursement issues. Given the multiple acceptable answers to most cases, the agreement between the audience’s votes and the output of ChatGPT was explored. The preferred choice by ChatGPT matched the top-2 most-voted answers by the audience in 90% (46/51) of cases. Conclusions: These results indicate high agreement between ChatGPT’s answers to a large number of challenging cases in medical oncology and those provided by the faculty of internationally recognized experts.
Gottlieb et al. (Thu,) studied this question.