Single-cell multi-omics sequencing technologies enable profiling various cellular aspects, offering valuable biological insights. Integrating unpaired multi-omics data, which involves profiling different modalities from distinct cells within the same overall population, remains challenging yet crucial for a comprehensive understanding of cell states and molecular dynamics. Although numerous computational methods for integrating such unpaired data exist, a systematic evaluation of the choices at each step is lacking. The recent emergence of technologies simultaneously profiling multiple modalities within the same cell provides paired datasets, allows for the systematic evaluation of unpaired pipelines. We leverage paired scRNA-seq with scATAC-seq and histone modifications (ChIP-seq) to systematically evaluate methods for unpaired scRNA and peak-based epigenomic (ATAC-seq and ChIP-seq) data integration to establish a robust, general pipeline. We benchmark individual steps, including feature linking, dimension reduction, and clustering, and evaluate their combinatorial effects on pipeline performance by testing various choices at each stage. Our findings reveal that while gene activity scores show limited correlation with gene expression, they effectively preserve cellular neighborhoods for clustering. Dimension reduction emerges as the most critical step, with non-linear methods generally offering better performance and linear methods providing robustness. Optimal transport (OT)-based label transfer consistently outperforms other strategies across various embeddings. This benchmark of unpaired integration provides valuable insights for developing methods suited for increasingly complex multi-omics study designs.
Naqing et al. (Wed,) studied this question.