What question did this study set out to answer?

The research aims to improve gene expression analysis in single-cell RNA-seq data by addressing dropout rates during data imputation and dimensionality reduction.

January 25, 2026Open Access

scIRT: Imputation and Dimensionality Reduction for Single-Cell RNA-Seq Data by Combining NMF with SMOTE

Puntos clave

The research aims to improve gene expression analysis in single-cell RNA-seq data by addressing dropout rates during data imputation and dimensionality reduction.
Developed scIRT, an iterative imputation pipeline combining SMOTE and NMF.
Applied scIRT to various single-cell RNA-seq datasets.
Evaluated the effectiveness of imputed matrices versus low-dimensional matrices for clustering.
scIRT demonstrated better performance compared to existing imputation methods.
Successful recovery of missing data facilitated improved cell type clustering and visualization.
The combination of SMOTE and NMF outperformed traditional methods in preserving data integrity.

Resumen

The establishment and development of single-cell RNA-sequencing (scRNA-seq) technology has accelerated the analysis of cell genome characteristics down to the single-cell level. Despite the rapid development of scRNA-seq technology, we cannot obtain a complete gene expression matrix in the biological experiments, and the scRNA-seq data obtained from experiments also have a high dropout rate. Unfortunately, gene expression analysis and clustering tools require a complete matrix of gene expression values for classification or clustering calculations. Most imputation methods focus on the impact of the imputed high-dimensional expression matrix on clustering and cannot obtain the low-dimensional representation matrix, which may have an even better guiding effect on clustering. To this end, we designed an iterative imputation pipeline called scIRT to estimate dropout events for scRNA-seq and achieve dimensionality reduction simultaneously by combining the synthetic minority over-sampling technique (SMOTE) and non-negative matrix factorization (NMF). The adaptation of SMOTE effectively imputes missing data, while NMF performs dimensionality reduction and feature extraction on high-dimensional data. Using several scRNA-seq datasets, we demonstrated that this new approach achieved better and more robust performance than the existing approaches. We also compared the different effects of the imputed matrix and the low-dimensional representation matrix on clustering. ScIRT is a tool that can be used to preprocess scRNA-seq data. It can effectively recover missing data from scRNA-seq to facilitate downstream analyses such as cell type clustering and visualization.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo