The establishment and development of single-cell RNA-sequencing (scRNA-seq) technology has accelerated the analysis of cell genome characteristics down to the single-cell level. Despite the rapid development of scRNA-seq technology, we cannot obtain a complete gene expression matrix in the biological experiments, and the scRNA-seq data obtained from experiments also have a high dropout rate. Unfortunately, gene expression analysis and clustering tools require a complete matrix of gene expression values for classification or clustering calculations. Most imputation methods focus on the impact of the imputed high-dimensional expression matrix on clustering and cannot obtain the low-dimensional representation matrix, which may have an even better guiding effect on clustering. To this end, we designed an iterative imputation pipeline called scIRT to estimate dropout events for scRNA-seq and achieve dimensionality reduction simultaneously by combining the synthetic minority over-sampling technique (SMOTE) and non-negative matrix factorization (NMF). The adaptation of SMOTE effectively imputes missing data, while NMF performs dimensionality reduction and feature extraction on high-dimensional data. Using several scRNA-seq datasets, we demonstrated that this new approach achieved better and more robust performance than the existing approaches. We also compared the different effects of the imputed matrix and the low-dimensional representation matrix on clustering. ScIRT is a tool that can be used to preprocess scRNA-seq data. It can effectively recover missing data from scRNA-seq to facilitate downstream analyses such as cell type clustering and visualization.
Mou et al. (Fri,) studied this question.