Hereditary cancers frequently arise from germline pathogenic variants, yet only a small proportion of reported variants have been clinically classified, leaving most missense variants unresolved as variants of uncertain significance (VUS). Although recent machine-learning approaches have explored disease-specific or gene-specific contexts to improve pathogenicity prediction, these models remain fundamentally limited by the scarcity of labeled data and the underutilization of abundant VUS. We propose a deep-learning framework that integrates autoencoder pretraining with a deep ensemble strategy to improve variant pathogenicity prediction, effectively leverage unlabeled VUS during pretraining, and reduce uncertainty arising from limited training samples. To validate each component of our framework, we evaluated its performance under both disease-specific and gene-specific training setups. Experiments on ClinVar variants from BRCA1, BRCA2, MLH1, and MSH2 showed that our framework achieved the best performance in the gene-specific setup for BRCA1—likely because BRCA1 contains substantially more gene-specific training data than the other genes—whereas the disease-specific setup yielded superior results for the remaining genes, which had comparatively limited gene-specific samples. Overall, our method significantly outperformed existing approaches. We also introduce an interpretability approach that provides variant-level importance profiles across pathogenicity classes, thereby enhancing transparency and clinical applicability. Moreover, by projecting feature-level importance scores into a two-dimensional space, we demonstrate that pretraining enables the model to learn distinctly different feature representations, illustrating how pretraining and ensemble learning synergistically contribute to improved predictive performance. Our framework preserves the specificity of disease- and gene-specific approaches, overcomes data scarcity through VUS-guided pretraining and ensembling, and offers interpretable outcomes that may be helpful for clinical decision support. Moreover, our results suggest a promising direction for pathogenicity prediction of rare missense variants and indicate that the proposed framework may be extendable to additional genes under appropriate data and modeling conditions.
Building similarity graph...
Analyzing shared references across papers
Loading...
Da-Bin Lee
Hyun-Uk Kang
Kyu-Baek Hwang
BioData Mining
Soongsil University
Building similarity graph...
Analyzing shared references across papers
Loading...
Lee et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69aa7037531e4c4a9ff59c1a — DOI: https://doi.org/10.1186/s13040-026-00533-5