What question did this study set out to answer?

The study aims to explore and classify long non-coding RNAs (lncRNAs) for breast cancer subtyping using an explainable AI framework.

March 13, 2026Open Access

An explainable-AI framework reveals novel lncRNAs specific for breast cancer subtypes

Key Points

The study aims to explore and classify long non-coding RNAs (lncRNAs) for breast cancer subtyping using an explainable AI framework.
Utilized 7,177 lncRNAs from 1,021 breast cancer transcriptomics datasets.
Built machine learning models using lncRNA, mRNA, and miRNA features.
Employed four machine learning classifiers: Naïve Bayes, Random Forest, Artificial Neural Network, and XGBoost.
Conducted sequential key feature identification using ANOVA, Boruta, and SHAP.
XGBoost achieved an accuracy of 89.2% for lncRNA-only classification.
Adding miRNA features improved accuracy to 90.8%, mRNA features to 92.2%.
Identified 119, 66, 54, and 24 unique features for Luminal A, Luminal B, HER2+, and Basal subtypes, respectively.
Significant novel lncRNAs associated with breast cancer subtypes were revealed.

Abstract

Background Long non-coding RNAs (lncRNAs) have emerged as important regulators in cancer biology; yet their potential for cancer subtyping remains underexplored particularly in the context of large-scale, multi-class supervised classification frameworks, due to limited publicly available data or their use only as auxiliary features in classification tasks. Methods In this study, we utilized an expansive set of 7,177 lncRNAs obtained from 1,021 breast cancer (BRCA) transcriptomics datasets for subtyping using an explainable artificial intelligence (AI) framework. lncRNA, mRNA, and miRNA features were used to build machine learning (ML) models individually and in combination. Four ML classifiers: Naïve Bayes, Random Forest, Artificial Neural Network, and XGBoost were employed to evaluate subtype classification performance. Results Using lncRNAs alone, XGBoost demonstrated strong performance with an accuracy of 89.2% and AUROC of 0.99. Addition of miRNA or mRNA features to lncRNA marginally improved the accuracy to 90.8% and 92.2%, respectively, while using all the three features together provided no further gain. A sequential key feature identification pipeline (ANOVA, Boruta, SHAP) has identified interpretable subtype-specific biomarker panels, yielding 119, 66, 54, and 24 unique features for Luminal A, Luminal B, HER2+, and Basal subtypes, respectively. Further lncRNA characterization followed by survival analysis revealed significant subtype-specific novel lncRNAs, including CUFF.25255 (LumA), CUFF.20237 and CUFF.3888 (LumB), CUFF.22414 (HER2+), and CUFF.26607 and CUFF.1961 (Basal). Conclusion Our findings highlight the diagnostic and biomarker discovery potential of lncRNAs, and the explainable-AI framework implemented here provides a systematic large-scale evaluation of lncRNA-only and integrative models for multi-class BRCA subtyping for BRCA subtyping and can be adopted to other cancers using the existing cancer transcriptomics data in the public databases.

Bookmark

View Full Paper

Bookmark

View Full Paper

An explainable-AI framework reveals novel lncRNAs specific for breast cancer subtypes

Key Points

Abstract

Cite This Study