What question did this study set out to answer?

This research aims to develop a robust post-integrated inference method that mitigates the biases in multiple hypothesis testing across heterogeneous datasets.

February 6, 2026

Assumption-Lean Post-Integrated Inference with Surrogate-Control Outcomes

Key Points

This research aims to develop a robust post-integrated inference method that mitigates the biases in multiple hypothesis testing across heterogeneous datasets.
Developed a post-integrated inference method using control outcomes.
Utilized negative-control outcomes for nonparametric identifiability of direct effects.
Introduced surrogate-control outcomes to enhance inference on direct-effect estimands.
Implemented finite-sample linear expansions and bias quantification methodologies.
Evaluated the method with simulations and analysis of single-cell CRISPR-perturbed datasets.
Proposed estimators provide consistent and efficient outcomes under minimal assumptions.
Bias quantifications demonstrate the robustness of the approach against model misspecification.
Simulations showed effective data-adaptive estimation using random forests.

Abstract

Summary Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variation, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by leveraging control outcomes. Using causal interpretations, we derive nonparametric identifiability of direct effects via negative-control outcomes. By utilizing surrogate-control outcomes as an extension of negative-control outcomes, we develop semiparametric inference on projected direct-effect estimands, accounting for hidden mediators, confounders and moderators. These estimands remain statistically meaningful under model misspecification and in the presence of error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation using machine learning algorithms. We evaluate our approach with random forests through simulations and the analysis of single-cell CRISPR-perturbed datasets, which may contain potential unmeasured confounders.

Mark Helpful

Bookmark

Relay