Abstract We consider the problem of learning optimal treatment policies from observational data. We propose an algorithm that combines doubly robust welfare estimation, to accommodate rich covariates and unknown propensity scores, and sample splitting, to adaptively select policy complexity. We show that the resulting treatment rule achieves the minimax-optimal rate of convergence in expected regret while selecting a suitable policy complexity with nearly oracle performance. Our analysis avoids unnecessarily restrictive assumptions commonly imposed on the data-generating process or on first-stage nonparametric estimators and yields a sharp characterization of the relevant universal constants. The practical performance of the proposed method is demonstrated in a simulation study.
Ponomarev et al. (Fri,) studied this question.