Summary Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variation, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference method that accounts for latent heterogeneity by leveraging control outcomes. Using causal interpretations, we derive nonparametric identifiability of direct effects via negative-control outcomes. By utilizing surrogate-control outcomes as an extension of negative-control outcomes, we develop semiparametric inference on projected direct-effect estimands, accounting for hidden mediators, confounders and moderators. These estimands remain statistically meaningful under model misspecification and in the presence of error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation using machine learning algorithms. We evaluate our approach with random forests through simulations and the analysis of single-cell CRISPR-perturbed datasets, which may contain potential unmeasured confounders.
Du et al. (Tue,) studied this question.