Abstract Background: KRAS activating mutations are the most prevalent oncogenic drivers in non-small cell lung cancer (NSCLC), with KRAS G12C as the dominant subtype. FDA-approved small-molecule inhibitors (sotorasib and adagrasib) have improved outcomes in KRAS G12C positive patients, though responses remain heterogeneous. To dissect clinical and molecular determinants of treatment response and overall survival, we analyzed large KRAS-mutant NSCLC cohorts with multimodal clinical annotation and targeted sequencing, integrating computational and machine learning (ML) approaches. Methods: 679 KRAS-mutant NSCLC patients including 43 patients treated with KRAS inhibitors were retrieved from UCSF Information Commons (data released in March 2025). Clinical, pathological, targeted tumor exome sequencing, survival, and treatment-response data were extracted from clinical notes using SQL-based algorithms and large language models. Structured datasets were generated for ML models including random forest (RF), multilayer perceptron (MLP), and XGBoost (XGB). An independent MSK-IMPACT cohort (n=2,152) was used for validation. All computational analyses were performed on the UCSF Wynton HPC cluster. Results: Comparing co-occurring mutations between UCSF KRAS G12C (n=185) and non-G12C KRAS (n=494) treatment naïve tumors, we identified 14 significant differential mutations (Fisher’s exact test, BH-corrected p0.05). LRP1B, KEAP1, RUNX1T1, ATP6AP1, ZFTA, TRAF7, and NF2 gene mutations were enriched in the KRAS G12C cohort. Clinical genetic data predicted KRAS inhibitor treatment outcomes (TO) (stable vs progressive disease) and overall survival (OS) with modest accuracies by models of RF (TO:0.67, OS:0.64), MLP (TO:0.83, OS:0.73), and XGB (TO: 0.67, OS:0.73). Intriguingly, the models identified co-mutations of NAV3, COL2A1, MLH3, PTPRD, and SOS2 associated with stable tumor disease, co-mutations of IGFBP3, SPEN and PTPRB correlated with progressive tumor disease, whereas co-mutations of DUSP4, WHSC1, EBF1 and MAP2K4 are potential covariates for poor OS. ML prediction of OS using MSK-IMPACT cohort with clinico-genetic test data achieved AUROC values of 0.80 (RF), 0.72 (MLP), and 0.81 (XGB), respectively. Feature-importance analyses highlighted metastasis, tumor mutation burden (TMB), and mutations in ERCC5, BCOR, and SPEN as major predictors of poor OS. Conclusion: Analysis of real-world multimodal clinical data revealed distinct biological and genomic features between KRAS G12C and non-G12C NSCLC, as well as key determinants of survival and treatment response. This study demonstrates the value of LLM-generated structured clinical data for AI/ML-driven oncology research with hypotheses generation and highlights its potential to improve personalized treatment decision-making. Citation Format: Qingtian Li, Albert Lee, Wei Wu, Trever G. Bivona.. Integrated machine learning and large language models reveal molecular determinants of survival and treatment response in KRAS-mutant lung cancer abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 5335.
Li et al. (Fri,) studied this question.