Using machine learning predictions as proxies for difficult-to-observe outcome variables can bias empirical estimates when prediction errors correlate with treatment variables. We describe methods for detecting and correcting these biases using a sample of ground truth data. These types of data are often not available in practice, however. We construct a novel dataset on deforestation in Africa using approximately optimal sampling methods and visual interpretation of high-resolution satellite imagery. We use the data to evaluate bias in widely used satellite-derived measures of deforestation. We find that deforestation is systematically under-predicted in areas with higher rates of deforestation.
Gordon et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: