What question did this study set out to answer?

The study aims to evaluate the concordance between large language models and molecular tumor board decisions in precision oncology.

May 29, 2026

The AI paradox in precision oncology: Prospective blinded validation of large language models against molecular tumor board.

Key Points

The study aims to evaluate the concordance between large language models and molecular tumor board decisions in precision oncology.
Prospective, blinded, cross-sectional validation study of 108 cases discussed at a national molecular tumor board.
Analyzed anonymized clinical and genomic data using four large language model versions across two independent runs.
Concordance categorized as concordant, discordant, AI non-evaluable, or MTB non-evaluable.
End-to-end concordance was 88%, 74%, 83%, and 55% for the four LLMs respectively.
High rates of citation-level hallucination were found, with up to 49% across models.
Higher evidence levels significantly predicted concordance for all models, indicating variability based on actionable insights.

Abstract

11047 Background: Molecular tumor boards (MTBs) are central to precision oncology but remain limited in accessibility. Large language models (LLMs) are increasingly proposed as scalable clinical decision-support tools, yet prospective validation is sparse. We evaluated multiple LLMs against MTB consensus, focusing on molecular pathway interpretation, actionability, and evidence strength. Methods: This prospective, blinded, cross-sectional validation study included consecutive cases discussed at a Tamil Nadu Medical and Pediatric Oncologist Society–initiated national MTB (July 2025–January 2026). Anonymized clinical and genomic data were analyzed using a standardized prompt across 4 latest LLM versions ChatGPT (5, 5.1, 5.2), Perplexity, Gemini (2.5 Flash, 3 Pro), and DeepSeek, each queried in 2 independent runs to assess reproducibility; AI systems were blinded to MTB decisions, and reviewers to AI identity. Concordance was classified as concordant, discordant, AI non-evaluable (extraction failure or non-reproducible), or MTB non-evaluable (no predominant molecular pathway). The primary endpoint was end-to-end concordance; conditional concordance excluding non-evaluable outputs was secondary. Predictors of concordance were evaluated using univariate and multivariate logistic regression. Results: Of 108 cases, 80 were evaluable for AI comparison. Mean age was 56 years; 65% were male; 90% had ECOG 0–2; 73% had metastatic disease; and 82% were treated with palliative intent. The cohort was heavily pretreated, with 48% receiving ≥3 prior lines. Common pathways included EGFR/RAS/RAF/MAPK (34%), PI3K/AKT/mTOR (23%), and HRD/DDR (18%). End-to-end concordance was 88%, 74%, 83%, and 55% across the four LLMs. AI non-evaluable outputs due to extraction failure or non-reproducibility across two independent runs occurred in 3.8% to 21.3% of cases across platforms, highlighting important limitations in reliability. Citation-level hallucination rates were high (41%, 36%, 49%, and 48%). On univariate analysis, higher evidence level predicted concordance across all models (all P ≤ 0.03). ESCAT tier was also significantly associated with concordance for LLM-2, LLM-3, and LLM-4. On multivariate analysis, evidence level remained the only consistent independent predictor for three LLMs, while ESCAT tier retained significance for one model. Inter-rater reliability was excellent (96.3%; κ = 0.93). Conclusions: LLM concordance is highest in guideline-supported, high-actionability settings but declines in low-evidence, poor-targetability scenarios—precisely where clinical support is most needed. This “AI paradox,” combined with substantial hallucination risk, indicates that while LLMs may assist first-pass molecular interpretation, expert multidisciplinary MTBs remain essential for safe and reliable precision oncology.

Bookmark

The AI paradox in precision oncology: Prospective blinded validation of large language models against molecular tumor board.

Key Points

Abstract

Cite This Study