e23449 Background: Prostate cancer (PCa) outcomes are shaped by a complex interplay of clinical and socioeconomic factors. In resource-limited settings, leveraging routinely collected cancer registry data represent a critical but underutilized source for prognostic modeling. However, data heterogeneity as well as missing key clinical variables often limit their direct clinical applicability; consequently, translating real-word registry data into reliable survival predictions remains a significant clinical gap. We aimed to develop and externally evaluate a machine learning model to predict overall survival in PCa patients using data from a Brazilian Hospital-based Cancer Registry (HbCR), thereby demonstrating a practical application of data science translating into clinical oncology. Methods: We conducted a retrospective cohort study including 10,556 PCa patients recorded in a statewide HbCR from Southeastern Brazil, probabilistically linked to Mortality Information System. The dataset was randimly divided into training (n = 8,418) and testing (n = 2,110) sets. Feature selection was performed using the Boruta algorithm, yielding15 key predictors encompassing demographic, clinical, and treatment-related variables. Multiple machine learning models were trained and compared, with the LightGBM algorithm ultimately selected on discriminative performance. The primary outcome was all-cause mortality; accordingly, the model was trained to predict this outcome. Results: In the independent test set, the final LightGBM model achieved an overall accuracy of 71.3% and a weighted F1-score of 0.72. For mortality prediction, the model yielded a precision of 0.62 and a recall of 0.71 (F1-score = 0.66). For survival, precision was 0.79 and recall was 0.72 (F1-score = 0.75). The most influential predictors of survival identified by the model included age, clinical and pathological TNM stage, initial treatment modality, disease status at the end of treatment, race, and socioeconomic proxies like treatment cost. Model performance remained robust despite the absence of prostate-specific antigen levels and Gleason scores, demonstrating effective learning from heterogeneous registry data. Conclusions: Machine learning applied to routinely collected HbCR data can generate clinically meaningful survival predictions for PCs even in the presence of incomplete clinical information. This population-based LightGBM model offers a scalable, low-cost strategy to identify high-risk patients and can support clinical decision-making and resource allocation in public health systems. By transforming administrative HbCR into actionable survival insights, this approach exemplifies the ‘science and practice of translation' and and highlights the translational potential of real-world data analytics to improve PCa outcomes in LMIC.
Filho et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: