This study demonstrates that ML models with similar overall performance can yield substantially divergent predictions at both the individual and subgroup levels, and that no single algorithm consistently outperforms others across all patient subgroups. These findings highlight the limitations of relying solely on global performance metrics and underscore the need for context-aware evaluation of ML models in heterogeneous clinical populations.
Magalhães et al. (Fri,) studied this question.