A high-fidelity, privacy-preserving synthetic dataset of diverse patient personas with longitudinal health trajectories was developed to evaluate AI models for early breast cancer detection.
The development of a synthetic dataset with longitudinal health trajectories and modifiable risk factors provides a privacy-preserving testbed for training AI models in early breast cancer detection.
Abstract Developing and validating AI/ML models for early cancer detection is significantly hampered by the scarcity, sensitivity, and ethical complexities associated with real patient data. Synthetic data offers a compelling solution, providing a controlled environment for rigorous model development and testing without compromising privacy. Furthermore, Modifiable Risk Factors (MRFs) such as diet, physical activity, alcohol consumption, sleep patterns, and smoking, are critical determinants of cancer risk, including breast cancer. The ability to simulate the dynamic interplay of these factors and their impact on health outcomes is crucial for designing effective personalized prevention strategies. This study aims to establish a robust, privacy-preserving framework for simulating personalized patient journeys to advance early cancer detection, with a particular focus on breast cancer. Our core objectives are: 1) To develop a rich synthetic dataset of patient personas augmented with detailed MRFs and longitudinal health parameters. 2) To enable "what-if" scenario analysis, allowing for the simulation of the impact of behavioral changes and clinical interventions on individual cancer risk trajectories. This is a synthetic cohort dataset generated with artificial intelligence, designed to serve as a foundational resource for building persona simulation engines. We initiate persona generation by adapting NVIDIA's Nemotron-Personas dataset, leveraging its inherent demographic diversity (e.g., age, gender, occupation, geographic distribution) as a robust base. These generic personas are then augmented using Synthea, an open-source synthetic electronic health record (EHR) generator. We specifically utilize Synthea's capabilities to model realistic, longitudinal patient histories, incorporating a dedicated breast cancer module to simulate disease progression and relevant clinical events over time. Each persona can be customized with a comprehensive set of parameters, including nuanced dietary patterns, specific lifestyle behaviors (e.g., physical activity levels, sleep patterns), social factors, and simulated environmental risks, with a strong emphasis on quantifiable MRFs. This rich data will eventually make it possible to simulate dynamic patient journeys. It will allow researchers to explore complex “what-if” questions by adjusting modifiable risk factors (MRFs) and observing their effects on cancer risk progression. For example, we could model scenarios such as “What if this patient had annual mammogram screenings for the past three years?” or “What if that patient stops smoking today?” A high-fidelity, privacy-preserving synthetic dataset of diverse patient personas with rich, longitudinal health trajectories relevant to breast cancer risk. This dataset serves as a useful testbed for developing and evaluating AI/ML models for early cancer detection. Future work will add “what-if” simulation capabilities that are expected to provide valuable insights into the personalized impact of MRF modifications and early interventions, laying the groundwork for future precision prevention strategies in oncology. Citation Format: N. H. Borges, L. E. Silva e Oliveira, I. C. Cazagranda, C. B. de Albuquerque, O. Marques, C. d. Costa. Simulating Personalized Patient Journeys for Early Cancer Detection Using Artificial Intelligence Synthetic Data abstract. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-06-25.
Borges et al. (Tue,) conducted a other in Breast cancer. AI synthetic data generation (Nemotron-Personas and Synthea) was evaluated on Development of a synthetic dataset of patient personas augmented with detailed MRFs and longitudinal health parameters. A high-fidelity, privacy-preserving synthetic dataset of diverse patient personas with longitudinal health trajectories was developed to evaluate AI models for early breast cancer detection.