Key points are not available for this paper at this time.
We investigate the issue of parameter estimation with nonuniform negative for imbalanced data. We first prove that, with imbalanced data, the information about unknown parameters is only tied to the relatively number of positive instances, which justifies the usage of negative. However, if the negative instances are subsampled to the same level the positive cases, there is information loss. To maintain more information, derive the asymptotic distribution of a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its. To further improve the estimation efficiency over the IPW method, we a likelihood-based estimator by correcting log odds for the sampled and prove that the improved estimator has the smallest asymptotic variance a large class of estimators. It is also more robust to pilot. We validate our approach on simulated data as well as a real-through rate dataset with more than 0. 3 trillion instances, collected a period of a month. Both theoretical and empirical results demonstrate effectiveness of our method.
Wang et al. (Mon,) studied this question.