Nonsense-mediated mRNA decay (NMD) is a critical post-transcriptional surveillance mechanism that degrades transcripts with premature termination codons, safeguarding transcriptome integrity and shaping disease phenotypes. However, accurately predicting NMD activity remains challenging, as existing models often rely on simplistic rule-based heuristics or limited feature sets, constraining their accuracy and generalizability. Using paired DNA and RNA data from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression Project (GTEx), we benchmark embedding-only models and demonstrate that they underperform compared to a simple rule-based approach. To address this, we develop NMDap (NMD activity predictor), an integrative framework that combines optimized rule-based methods, sequence embeddings, and curated biological features, achieving improved predictive performance relative to simple rule-based and embedding-only models. Through explainable AI, we identify key NMD determinants, reaffirming established factors and highlighting additional associated features such as mean ribosome loading. NMDap generalizes well to independent datasets and enables large-scale mRNA degradation assessments, as demonstrated by its application to more than 2.9 million simulated stop-gain variants, advancing variant interpretation and transcriptome-informed disease research. • NMDap improves NMD activity prediction beyond rules and embeddings alone. • Rules, curated annotations, and embeddings jointly yield best performance. • XGBoost imputes half-life, ribosome load, and localization from embeddings. • SHAP highlights key NMD drivers, including ribosome loading. • Genome-wide scoring of 2.9M stop-gain variants aids variant interpretation.
Saadat et al. (Wed,) studied this question.