ABSTRACT Deep learning models for Kellgren–Lawrence (KL) grading often report optimistic performance due to data leakage and fail to generalize across institutions because of domain shift. To address this reproducibility crisis, we introduce KL‐FuseNet, a multitask architecture fusing global (ConvNeXt‐Base) and local (ResNet‐50) features to predict ordinal grades, label distributions, and binary severity (KL≥2). Using strict patient‐wise stratified splits on an internal osteoarthritis initiative dataset ( n = 8260) and an independent Chinese cohort ( n = 2295), we compared zero‐shot transfer against selective fine‐tuning. KL‐FuseNet achieved robust internal agreement (quadratic Cohen's kappa QWK: 0.881; accuracy: 70.3%). While external zero‐shot deployment revealed a domain gap, with accuracy dropping to 66.1%, our selective fine‐tuning protocol significantly bridged this divide, boosting external accuracy to 80.0% and QWK to 0.950, with an AUC of 0.984 for clinically significant osteoarthritis (KL≥2). These results demonstrate that while KL‐FuseNet achieves state‐of‐the‐art performance under rigorous evaluation, domain‐aware adaptation is essential for clinical utility. This study establishes a reproducible pathway for deploying automated grading models across heterogeneous medical centers.
Alkhatatbeh et al. (Sun,) studied this question.