Abstract Background: Personalized cancer vaccines hold great promise by eliciting tumor-specific immune responses 1-3. A key challenge is identifying the right targets — immunogenic protein sequences, or epitopes, presented on tumor cells. While computational pipelines can predict epitope candidates from tumor sequencing, experimental validation is costly and slow. Leveraging literature and database knowledge could bridge this gap by enabling evidence-driven selection of high-confidence targets, but is constrained by fragmented information across journals and immunology databases 4-5. We introduce EpitopeMiner, which integrates sequence-based candidate screening with evidence-driven knowledge retrieval for epitope prioritization. Methods: A total of 25,966 tumor-specific epitopes were predicted from whole-genome sequencing of tumor-PBMC pairs from nine patients (including lung, sarcoma, NKTL, DLBCL) using a standard workflow: HLA typing (OptiType), variant calling (Strelka with wANNOVAR), MHC binding prediction (NetMHCpan) and RNA-supported protein-altering filtering. EpitopeMiner combines OpenAI’s Large Language Model (LLM) with an in-house Retrieval Augmented Generation (RAG) database comprising (a) 78,461 full-text research articles from PMC, PLOS One, and Europe PMC and, (b) ∼2.6 million unique epitopes from IEDB, dbPepNeo, SystemMHC, TANTIGEN, and caAtlas database. EpitopeMiner includes: (i) a screening module that processes an epitope list, detecting exact or ≥ 7 amino acid partial matches from the in-house database, and (ii) a reporting module that analyses each top-ranked hits, defined by highest sequence similarity and evidence density, to generate an LLM response covering 28 immunology keywords with citations. Results: Among the 25,966 epitopes predicted from the nine patients, EpitopeMiner found 7 exact matches, and 17.6% had ≥ 6 partial matches; mean processing time per epitope was 0.97 seconds. In benchmarking with 3 lung cancer driver-gene epitopes (KITDFGRAK, ITDFGRAKL, TDFGRAKLL), EpitopeMiner outperformed ChatGPT and Gemini, returning the highest amount of relevant immunological information — summarized as (total responses, % with evidence) — (18, 100%), (8, 100%), and (2, 100%) respectively, compared to ChatGPT’s (10, 70%), (4, 50%), (6, 33%) and Gemini’s (7, 0%), (1, 0%), (1, 0%). In addition, EpitopeMiner retrieved ≥10 partial matches for each epitope, whereas ChatGPT retrieved total of 3 and Gemini none. Conclusion: We built EpitopeMiner, a computational framework for sustainable literature and database curation. In a 9-patient dataset, EpitopeMiner retrieved experimentally and clinically validated epitope evidence at a scale and speed infeasible with manual analysis. EpitopeMiner outperformed general-purpose LLMs with cited responses, achieving 100% evidence coverage on benchmarks, reducing hallucinations and improving reliability. Citation Format: Agamjyot Singh Chadha, Isaac Jiasheng Cheong, Marcia Zhang, Wei Kit Tan, Wei Lin Tang, Jing Quan Lim, Solomonraj Wilson, Choon Kiat Ong, Bernett Lee, Chwee Ming Lim, Olaf Rotzschke, Mai Chan Lau. EpitopeMiner: Scalable knowledge mining for evidence-driven personalized cancer vaccine design abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 6698.
Chadha et al. (Fri,) studied this question.