Single-cell RNA sequencing (scRNA-seq) is entering an era of foundation models that accept the complete gene atlas as input, yet most current datasets cover only 10-12k genes and contain numerous technical zeros, severely limiting the generalization of these models in downstream tasks. To address this, we pioneer the gene-completion task for scRNA-seq and present SAD, a diffusion-based framework tailored to extremely sparse data, capable of completing genes and correcting sparsity bias under high missing rates. Unlike imputation or reconstruction methods that rely on the i.i.d. assumption, SAD's completion paradigm can generate gene entries originally absent from the expression profile, be aware of and rectify sparsity-distribution bias, and supply foundation models with consistent, reliable inputs of more than 30k genes. Extensive benchmarks show that SAD significantly outperforms existing methods across multiple completion metrics, particularly in extreme scenarios with missing rates above 80%. This provides a data foundation for reusing missing scRNA-seq information and for precision-medicine applications. The code is available at https://github.com/ZhangLab312/SAD.
Li et al. (Thu,) studied this question.