What does this research mean for the field?

A threshold-based online learning algorithm can achieve an optimal 1/2-competitive ratio with sublinear regret for allocating reusable resources under stochastic, unknown rewards and durations, matching the fundamental performance limit for non-anticipatory algorithms. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to develop algorithms for optimal resource allocation under uncertain task rewards and durations.

May 31, 2026

Learning to Allocate Reusable Resources under Stochastic Rewards and Durations

Key Points

This research aims to develop algorithms for optimal resource allocation under uncertain task rewards and durations.
Established offline benchmark for resource allocation using a prophet approach.
Derived an infinite-dimensional linear program to evaluate the expected reward.
Developed online algorithms incorporating a profitability threshold for task acceptance.
The proposed algorithm achieves a 1/2-approximate regret of O(√T) in known reward settings.
In unknown reward settings, the algorithm maintains a 1/2-competitive ratio with O(T^2/3) regret.
Introducing a termination mechanism recovers similar levels of approximate regret in scenarios with unobservable durations.

Abstract

We study the problem of sequentially allocating a reusable resource to stochastically arriving tasks whose rewards and durations are both random with unknown statistics. Upon each observable arrival (i.e., when the resource is idle), the task type and duration are revealed, and the decision-maker must immediately choose whether to accept or reject the task; noisy reward feedback is revealed only after making the decision. Arrivals that occur while the resource is busy are not observed. This setting captures practical admission-control problems with reusable resources, such as cloud job scheduling, crowdsourcing, and other partially observable service systems in which accepting a task temporarily makes a server or worker unavailable. We first establish a fundamental limit for the full-information offline (prophet) benchmark: no non-anticipatory online algorithm can guarantee more than a (1/2) fraction of ( E OPT ( T )), even when the reward and duration distributions are known. This impossibility motivates our use of (1/2)-approximate regret, namely the additive gap to (1 over 2 E OPT ( T )). To develop online algorithms, we first derive an infinite-dimensional linear program (LP) through a steady-state analysis, establishing an upper bound on the expected reward of offline optimum. The structure of this LP motivates a simple and effective threshold-based policy: a task is accepted only if its ''profitability'', defined as the ratio of its reward to its duration plus the expected waiting time, exceeds a specific threshold. Motivated by this insight, we develop online algorithms that estimate the profitability threshold from observed task arrivals. As a warm-up, we first propose an algorithm for the setting where the reward functions are known, and show that it achieves an 1/2-approximate regret of O (√ T ). We then extend the algorithm to the fully online learning setting with unknown reward functions, where we incorporate a nonparametric estimation procedure to learn the reward functions. The resulting algorithm preserves the 1/2-competitive ratio and attains an 1/2-approximate regret of O ( T 2/3 ). Finally, we show that when task durations are also unobservable upon arrival and are revealed only after task completion, the problem fundamentally degenerates and no meaningful competitive guarantee is achievable. We demonstrate that by augmenting the model with a termination mechanism (allowing interrupting an ongoing task at the cost of losing its reward), the same order of approximate regret can be recovered.

Mark Helpful

Bookmark

Relay