Accurate identification of cell types from single-cell multimodal data requires integration methods that preserve biological neighborhood structure while reducing modality-specific sparsity and noise. We present sxSNF, a single-cell multimodal integration framework that first constructs modality-specific cell-cell similarity graphs, then performs iterative similarity network fusion (SNF) before self-supervised graph representation learning. This fusion-before-learning design uses the fused graph as a denoised structural prior and further refines it through a masked-edge reconstruction objective with negative sampling. On PBMC-10 k and SHARE-seq benchmarks, sxSNF achieved ARI values of 0.694 and 0.589, respectively, and showed higher NMI/AMI values than the evaluated baselines under the same evaluation protocol. On the Chen-2019 SNARE-seq dataset, sxSNF recovered major cortical cell identities, marker-gene programs, and refined oligodendrocyte-lineage subpopulations supported by coordinated RNA expression and ATAC-derived gene activity. These results suggest that combining SNF-based structural denoising with graph learning can improve multimodal single-cell clustering and downstream biological interpretation. The code for sxSNF is available at https://github.com/labxscut/sxSNF.
Duan et al. (Thu,) studied this question.