Introduction: The application of machine learning in healthcare requires models that demonstrate not only acceptable classification performance but also trustworthy learning behavior suitable for clinical deployment. Class imbalance represents a pervasive challenge in medical datasets, where patients with favorable outcomes substantially outnumber those with adverse events. Materials and methods: This study compared two ensemble learning approaches for five-year survival prediction in eye cancer: CatBoost, a gradient boosting algorithm employing balanced class weights, and RUSBoost, an algorithm integrating random undersampling directly within the boosting framework. Model evaluation extended beyond aggregate performance metrics to include systematic assessment of learning dynamics throughout training. Results: Both classifiers achieved comparable discriminative ability on held-out test data, with area under the receiver operating characteristic curve values of approximately 0.78. Confusion matrix analysis revealed that both models demonstrated acceptable classification rates with expected gradual decreases from training through validation to test partitions. However, examination of learning curves revealed a critical distinction: the RUSBoost classifier exhibited healthy learning dynamics characterized by parallel training and validation curves with a stable and narrow gap, whereas the CatBoost classifier displayed progressively widening divergence between training and validation performance indicative of overfitting that necessitated early stopping intervention. A practitioner examining only confusion matrices and aggregate metrics might reasonably but incorrectly favor CatBoost based on its marginal advantage in classification consistency. Conclusions: These findings demonstrate that model selection in medical artificial intelligence must prioritize transparency in learning dynamics over aggregate performance metrics alone, as models achieving favorable summary statistics through problematic learning pathways cannot be considered trustworthy for clinical application where patient outcomes depend on prediction reliability. This study establishes evaluation criteria to ensure that, when machine learning-based decision support is considered appropriate for a given clinical context, the selected model exhibits learning behavior consistent with genuine predictive capability.
Shin et al. (Thu,) studied this question.