Abstract Introduction: Previous studies have suggested that smoking intensity may differ across ancestral populations. The lack of availability of real-world data often limits these types of analyses, however the use of artificial intelligence has accelerated our ability to annotate clinical data at scale. This work utilizes large language models (LLMs) to extract smoking behavior-related attributes, to examine the interplay between smoking exposure and genetic ancestry in a large clinico-genomic cohort spanning 20 cancer types. Methods: Clinical attributes such as age, sex, and stage were extracted from electronic health records (EHR). Yost Index/socioeconomic status (SES) was assigned based on patient census tract. Genetic ancestry was determined from MSK-IMPACT using reference populations from the 1000GP, including European (EUR), African (AFR), East Asian (EAS), and assigned based on an ancestral proportion 80%. We imputed genome-wide germline variants from on-and off-target reads. Tumor mutational burden (TMB) was calculated as the total number of nonsynonymous mutations per genome sequenced per panel; TMB percentile was generated per cancer type and used for the analysis. We evaluated the ability of regex, Llama3 (8b/70b/405b) and GPT4o to curate detailed smoking data (including pack years (PYs) and smoking duration(SD) from medical oncology notes. SD and PYs were split into four groups based on smoking severity: 0, 0-20, 20-40, 40. We controlled for age, sex, stage, histology, and SES in our multivariate models. Ancestral groups were compared using wilcoxon or kruskal wallis tests. Results: LLMs outperformed regex for extracting smoking data, with GPT4o performing the best overall for pack year curation (0.85 accuracy). Our cohort consisted of 31,916 patients treated at Memorial Sloan Kettering Cancer Center (12K smokers and 19K never smokers). In a multivariate model, AFR (NSCLC, p0.001) and EAS ancestry (NSCLC, p0.001) had a negative association with PY and SD, unlike EUR patients (NSCLC, p0.001; Bladder, p0.001). Increases in PYs and SD were positively associated with mutational signature SBS4 (p0.001). TMB has previously been shown to have a dose-dependent relationship with smoking exposure; these findings were replicated in our cohort (PYs:p0.001: SD:p0.001). Light smoking resulted in a more significant increase in TMB in AFR compared to EUR patients (PYs: p0.001;SD:p0.01). We assessed the interaction between continuous AFR ancestry and nicotinic acetylcholine receptor (nAChR) germline variant as a potential mechanism for increased TMB, and identified novel alleles which have a joint effect on TMB (rs117129712; Coef:19.89, p=0.036). Conclusion: EUR, EAS, and AFR ancestry populations have distinct smoking phenotypes which support prior epidemiological evidence of differences in smoking behavior. Further evaluation is needed to understand how differences in TMB between ancestry groups for light smokers may impact patient outcomes. Citation Format: Tejiri Agbamu, Michele Waters, Nicholas Pickersgill, Tomin Perea-Chamblee, Xinran Bi, Xuechun Bai, Jian Carrot-Zhang, Christopher Fong, justin jee, Nikolaus Schultz. Characterizing interactions between genomic ancestry and social determinants of health and their implications for patient outcomes by leveraging LLM-annotated smoking data in a large clinicogenomic cohort abstract. In: Proceedings of the 18th AACR Conference on the Science of Cancer Health Disparities; 2025 Sep 18-21; Baltimore, MD. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2025;34(9 Suppl):Abstract nr B159.
Agbamu et al. (Thu,) studied this question.