Background Long non-coding RNAs (lncRNAs) have emerged as important regulators in cancer biology; yet their potential for cancer subtyping remains underexplored particularly in the context of large-scale, multi-class supervised classification frameworks, due to limited publicly available data or their use only as auxiliary features in classification tasks. Methods In this study, we utilized an expansive set of 7,177 lncRNAs obtained from 1,021 breast cancer (BRCA) transcriptomics datasets for subtyping using an explainable artificial intelligence (AI) framework. lncRNA, mRNA, and miRNA features were used to build machine learning (ML) models individually and in combination. Four ML classifiers: Naïve Bayes, Random Forest, Artificial Neural Network, and XGBoost were employed to evaluate subtype classification performance. Results Using lncRNAs alone, XGBoost demonstrated strong performance with an accuracy of 89.2% and AUROC of 0.99. Addition of miRNA or mRNA features to lncRNA marginally improved the accuracy to 90.8% and 92.2%, respectively, while using all the three features together provided no further gain. A sequential key feature identification pipeline (ANOVA, Boruta, SHAP) has identified interpretable subtype-specific biomarker panels, yielding 119, 66, 54, and 24 unique features for Luminal A, Luminal B, HER2+, and Basal subtypes, respectively. Further lncRNA characterization followed by survival analysis revealed significant subtype-specific novel lncRNAs, including CUFF.25255 (LumA), CUFF.20237 and CUFF.3888 (LumB), CUFF.22414 (HER2+), and CUFF.26607 and CUFF.1961 (Basal). Conclusion Our findings highlight the diagnostic and biomarker discovery potential of lncRNAs, and the explainable-AI framework implemented here provides a systematic large-scale evaluation of lncRNA-only and integrative models for multi-class BRCA subtyping for BRCA subtyping and can be adopted to other cancers using the existing cancer transcriptomics data in the public databases.
Patel et al. (Tue,) studied this question.