What question did this study set out to answer?

The study aims to examine how feature representation impacts the accuracy and interpretability of CRISPR off-target predictions.

April 24, 2026Open Access

Feature representation for explainable CRISPR off-target prediction and base editing efficiency

Key Points

The study aims to examine how feature representation impacts the accuracy and interpretability of CRISPR off-target predictions.
Analyzed gene knockout and base editing using benchmark datasets including CHANGE-seq and GUIDE-seq.
Employed XGBoost models for classification and regression tasks.
Utilized interpretability analysis with SHAP to evaluate feature importance.
For gene knockout, One-Hot encoding achieved the best results on GUIDE-seq with AUPR = 0.661.
Bulges representation performed best on CHANGE-seq for knockout tasks with AUPR = 0.612.
In base editing, One-hot encoding provided superior accuracy with AUPR = 0.723 compared to K-mer representation.

Abstract

Introduction The interaction between guide RNAs (gRNAs) and target DNA sequences is a critical factor in the effectiveness of CRISPR/Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats/CRISPR-associated protein 9) gene editing. Predicting these interactions accurately necessitates models that offer biological knowledge in addition to high accuracy. This study analyzes the impact of feature representation on accuracy and interpretability in off-target prediction. Methods We address two CRISPR applications: gene knockout (KO) and base editing (BE) using distinct benchmark datasets. For the KO problem, we utilized CHANGE-seq and GUIDE-seq to evaluate paired sequence representations, while the Hanna screening dataset has been used for BE. We approached the prediction problem both as a classification and regression task using XGBoost models. Results In the case of KO, there is not a single universally optimal encoding. For both classification and regression, One-Hot and its variants (OH, OH5C) achieve the best results on GUIDE-seq (AUPR = 0.661, Pearson = 0.756), while the Bulges representation performs best on CHANGE-seq (AUPR = 0.612, Pearson = 0.602). In the case of BE, One-hot encoding consistently outperforms K-mer representation for predictive accuracy both as regression and classification (AUPR = 0.723, Pearson = 0.746). Discussion Our analysis demonstrates comparable predictive performance across both gene knockout and base editing tasks, confirming the robustness of the framework in distinct editing domains. Interpretability analysis using SHapley Additive exPlanations (SHAP) reveals that despite different mechanisms, the Protospacer Adjacent Motif (PAM)-proximal region remains a critical feature for prediction for both editing mechanisms.

Bookmark

View Full Paper

Bookmark

View Full Paper

Feature representation for explainable CRISPR off-target prediction and base editing efficiency

Key Points

Abstract

Cite This Study