What type of study is this?

This is a Quantitative Study study.

October 1, 2025Open Access

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

Key Points

Follow-the-Perturbed-Leader policy achieves near optimal regret bound, improving efficiency in arm selection.
The regret bound for FTPL with Fréchet perturbation is shown to be O(sqrt(nm)(sqrt(d log(d)) + m^(5/6))).
Lower bounds indicate extra factors in the approach are unavoidable, necessitating innovative methods for improvement.
The work establishes best-of-both-world regret bounds, balancing adversarial and stochastic settings effectively.

Abstract

We consider a common case of the combinatorial semi-bandit problem, the m-set semi-bandit, where the learner exactly selects m arms from the total d arms. In the adversarial setting, the best regret bound, known to be O (nmd) for time horizon n, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the m arms that rank among the m smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fréchet perturbation also enjoys the near optimal regret bound O (nm (d (d) +m^5/6) ) in the adversarial setting and approaches best-of-both-world regret bounds, i. e. , achieves a logarithmic regret for the stochastic setting. Moreover, our lower bounds show that the extra factors are unavoidable with our approach; any improvement would require a fundamentally different and more challenging method.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper