Background: Complete blood count (CBC) parameters are non-specific disease indicators, limiting their diagnostic utility when used individually. Objective: To develop and validate a machine learning model combining multiple CBC biomarkers (TLC, PCV, PLT, RDW, HGB) for early screening of patients with abnormal blood profiles suggestive of autoimmune disorders and/or malignancies. Methods: CBC data were analysed from a Kaggle dataset comprising 364 patients. Outlier binary indicators used NIH reference ranges. A logistic regression model was established (n = 292, 80%) and verified (n = 72, 20%) by repeated k-fold cross-validation with a tenfold. VIF < 5 was used to assess multicollinearity. Results: The model's AUC is 0.886, with 10-fold cross-validation accuracy of 78% for IMAGE, sensitivity of 88.6%, specificity of 50%, and precision of 82.4%. Four predictors were significantly associated: TLC (p<0.001), PCV (p<0.001), PLT (p=0.003), and RDW (p=0.008); HGB was not significantly associated (p=0.142). The model detected 72.5% of patients with at least one CBC abnormality for clinical follow-up, and 3.02% with multiple concurrent abnormalities. Gender differences were observed (male: 35.4% positive, female: 17% positive). Conclusion: This proof-of-concept demonstrates that logistic regression modelling of CBC outliers can identify high-risk patients for further diagnostic workup. However, the low specificity (50%) and lack of confirmed diagnoses limit clinical applicability. External validation on larger, multi-centre datasets with verified disease outcomes is required before clinical implementation.
Das et al. (Wed,) studied this question.