11047 Background: Molecular tumor boards (MTBs) are central to precision oncology but remain limited in accessibility. Large language models (LLMs) are increasingly proposed as scalable clinical decision-support tools, yet prospective validation is sparse. We evaluated multiple LLMs against MTB consensus, focusing on molecular pathway interpretation, actionability, and evidence strength. Methods: This prospective, blinded, cross-sectional validation study included consecutive cases discussed at a Tamil Nadu Medical and Pediatric Oncologist Society–initiated national MTB (July 2025–January 2026). Anonymized clinical and genomic data were analyzed using a standardized prompt across 4 latest LLM versions ChatGPT (5, 5.1, 5.2), Perplexity, Gemini (2.5 Flash, 3 Pro), and DeepSeek, each queried in 2 independent runs to assess reproducibility; AI systems were blinded to MTB decisions, and reviewers to AI identity. Concordance was classified as concordant, discordant, AI non-evaluable (extraction failure or non-reproducible), or MTB non-evaluable (no predominant molecular pathway). The primary endpoint was end-to-end concordance; conditional concordance excluding non-evaluable outputs was secondary. Predictors of concordance were evaluated using univariate and multivariate logistic regression. Results: Of 108 cases, 80 were evaluable for AI comparison. Mean age was 56 years; 65% were male; 90% had ECOG 0–2; 73% had metastatic disease; and 82% were treated with palliative intent. The cohort was heavily pretreated, with 48% receiving ≥3 prior lines. Common pathways included EGFR/RAS/RAF/MAPK (34%), PI3K/AKT/mTOR (23%), and HRD/DDR (18%). End-to-end concordance was 88%, 74%, 83%, and 55% across the four LLMs. AI non-evaluable outputs due to extraction failure or non-reproducibility across two independent runs occurred in 3.8% to 21.3% of cases across platforms, highlighting important limitations in reliability. Citation-level hallucination rates were high (41%, 36%, 49%, and 48%). On univariate analysis, higher evidence level predicted concordance across all models (all P ≤ 0.03). ESCAT tier was also significantly associated with concordance for LLM-2, LLM-3, and LLM-4. On multivariate analysis, evidence level remained the only consistent independent predictor for three LLMs, while ESCAT tier retained significance for one model. Inter-rater reliability was excellent (96.3%; κ = 0.93). Conclusions: LLM concordance is highest in guideline-supported, high-actionability settings but declines in low-evidence, poor-targetability scenarios—precisely where clinical support is most needed. This “AI paradox,” combined with substantial hallucination risk, indicates that while LLMs may assist first-pass molecular interpretation, expert multidisciplinary MTBs remain essential for safe and reliable precision oncology.
Seshachalam et al. (Wed,) studied this question.