What question did this study set out to answer?

The aim was to develop a machine learning model for predicting overall survival in prostate cancer patients using data from a Brazilian cohort.

May 30, 2026

Translating real-world data into actionable insights: An AI-powered model for prostate cancer survival prediction in a Brazilian cohort.

Key Points

The aim was to develop a machine learning model for predicting overall survival in prostate cancer patients using data from a Brazilian cohort.
Conducted a retrospective cohort study with 10,556 prostate cancer patients from a statewide cancer registry.
Used the Boruta algorithm for feature selection resulting in 15 key predictors.
Trained multiple machine learning models and selected LightGBM based on performance metrics.
LightGBM model achieved 71.3% accuracy and weighted F1-score of 0.72 in the test set.
For mortality prediction, precision was 0.62 and recall was 0.71 (F1-score = 0.66).
For survival, the model showed precision of 0.79 and recall of 0.72 (F1-score = 0.75).

Abstract

e23449 Background: Prostate cancer (PCa) outcomes are shaped by a complex interplay of clinical and socioeconomic factors. In resource-limited settings, leveraging routinely collected cancer registry data represent a critical but underutilized source for prognostic modeling. However, data heterogeneity as well as missing key clinical variables often limit their direct clinical applicability; consequently, translating real-word registry data into reliable survival predictions remains a significant clinical gap. We aimed to develop and externally evaluate a machine learning model to predict overall survival in PCa patients using data from a Brazilian Hospital-based Cancer Registry (HbCR), thereby demonstrating a practical application of data science translating into clinical oncology. Methods: We conducted a retrospective cohort study including 10,556 PCa patients recorded in a statewide HbCR from Southeastern Brazil, probabilistically linked to Mortality Information System. The dataset was randimly divided into training (n = 8,418) and testing (n = 2,110) sets. Feature selection was performed using the Boruta algorithm, yielding15 key predictors encompassing demographic, clinical, and treatment-related variables. Multiple machine learning models were trained and compared, with the LightGBM algorithm ultimately selected on discriminative performance. The primary outcome was all-cause mortality; accordingly, the model was trained to predict this outcome. Results: In the independent test set, the final LightGBM model achieved an overall accuracy of 71.3% and a weighted F1-score of 0.72. For mortality prediction, the model yielded a precision of 0.62 and a recall of 0.71 (F1-score = 0.66). For survival, precision was 0.79 and recall was 0.72 (F1-score = 0.75). The most influential predictors of survival identified by the model included age, clinical and pathological TNM stage, initial treatment modality, disease status at the end of treatment, race, and socioeconomic proxies like treatment cost. Model performance remained robust despite the absence of prostate-specific antigen levels and Gleason scores, demonstrating effective learning from heterogeneous registry data. Conclusions: Machine learning applied to routinely collected HbCR data can generate clinically meaningful survival predictions for PCs even in the presence of incomplete clinical information. This population-based LightGBM model offers a scalable, low-cost strategy to identify high-risk patients and can support clinical decision-making and resource allocation in public health systems. By transforming administrative HbCR into actionable survival insights, this approach exemplifies the ‘science and practice of translation' and and highlights the translational potential of real-world data analytics to improve PCa outcomes in LMIC.

Mark Helpful

Bookmark

Relay