What question did this study set out to answer?

The aim is to develop a reliable deep learning model for Kellgren–Lawrence grading that can generalize across different medical institutions.

March 16, 2026Open Access

Cross‐Institutional Five‐Class Kellgren–Lawrence Grading of Knee Osteoarthritis via Multitask Deep Learning

Key Points

The aim is to develop a reliable deep learning model for Kellgren–Lawrence grading that can generalize across different medical institutions.
Utilized KL-FuseNet, merging global and local feature extraction methods.
Applied strict patient-wise stratified splits on internal and independent datasets.
Compared zero-shot transfer with selective fine-tuning for model evaluation.
Achieved internal agreement with quadratic Cohen's kappa of 0.881 and accuracy of 70.3%.
Accuracy dropped to 66.1% during external zero-shot deployment.
Selective fine-tuning improved external accuracy to 80.0% and QWK to 0.950, with AUC of 0.984 for significant osteoarthritis.

Abstract

ABSTRACT Deep learning models for Kellgren–Lawrence (KL) grading often report optimistic performance due to data leakage and fail to generalize across institutions because of domain shift. To address this reproducibility crisis, we introduce KL‐FuseNet, a multitask architecture fusing global (ConvNeXt‐Base) and local (ResNet‐50) features to predict ordinal grades, label distributions, and binary severity (KL≥2). Using strict patient‐wise stratified splits on an internal osteoarthritis initiative dataset ( n = 8260) and an independent Chinese cohort ( n = 2295), we compared zero‐shot transfer against selective fine‐tuning. KL‐FuseNet achieved robust internal agreement (quadratic Cohen's kappa QWK: 0.881; accuracy: 70.3%). While external zero‐shot deployment revealed a domain gap, with accuracy dropping to 66.1%, our selective fine‐tuning protocol significantly bridged this divide, boosting external accuracy to 80.0% and QWK to 0.950, with an AUC of 0.984 for clinically significant osteoarthritis (KL≥2). These results demonstrate that while KL‐FuseNet achieves state‐of‐the‐art performance under rigorous evaluation, domain‐aware adaptation is essential for clinical utility. This study establishes a reproducible pathway for deploying automated grading models across heterogeneous medical centers.

Cross‐Institutional Five‐Class Kellgren–Lawrence Grading of Knee Osteoarthritis via Multitask Deep Learning

Key Points

Abstract

Cite This Study