Key points are not available for this paper at this time.
We address the problem of learning in an online, bandit setting where the learner must repeatedly select among K actions, but only receives partial feedback based on its choices. We establish two new facts: First, using a new algorithm called Exp4. P, we show that it is possible to compete with the best in a set of N experts with probability 1-δ while incurring regret at most O (KT (N/δ) ) over T time steps. The new algorithm is tested empirically in a large-scale, real-world dataset. Second, we give a new algorithm called VE that competes with a possibly infinite set of policies of VC-dimension d while incurring regret at most O (T (d (T) + (1/δ) ) ) with probability 1-δ. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing supervised learning type guarantees for the contextual bandit setting.
Beygelzimer et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: