We study the problem of sequentially allocating a reusable resource to stochastically arriving tasks whose rewards and durations are both random with unknown statistics. Upon each observable arrival (i.e., when the resource is idle), the task type and duration are revealed, and the decision-maker must immediately choose whether to accept or reject the task; noisy reward feedback is revealed only after making the decision. Arrivals that occur while the resource is busy are not observed. This setting captures practical admission-control problems with reusable resources, such as cloud job scheduling, crowdsourcing, and other partially observable service systems in which accepting a task temporarily makes a server or worker unavailable. We first establish a fundamental limit for the full-information offline (prophet) benchmark: no non-anticipatory online algorithm can guarantee more than a (1/2) fraction of ( E OPT ( T )), even when the reward and duration distributions are known. This impossibility motivates our use of (1/2)-approximate regret, namely the additive gap to (1 over 2 E OPT ( T )). To develop online algorithms, we first derive an infinite-dimensional linear program (LP) through a steady-state analysis, establishing an upper bound on the expected reward of offline optimum. The structure of this LP motivates a simple and effective threshold-based policy: a task is accepted only if its ''profitability'', defined as the ratio of its reward to its duration plus the expected waiting time, exceeds a specific threshold. Motivated by this insight, we develop online algorithms that estimate the profitability threshold from observed task arrivals. As a warm-up, we first propose an algorithm for the setting where the reward functions are known, and show that it achieves an 1/2-approximate regret of O (√ T ). We then extend the algorithm to the fully online learning setting with unknown reward functions, where we incorporate a nonparametric estimation procedure to learn the reward functions. The resulting algorithm preserves the 1/2-competitive ratio and attains an 1/2-approximate regret of O ( T 2/3 ). Finally, we show that when task durations are also unobservable upon arrival and are revealed only after task completion, the problem fundamentally degenerates and no meaningful competitive guarantee is achievable. We demonstrate that by augmenting the model with a termination mechanism (allowing interrupting an ongoing task at the cost of losing its reward), the same order of approximate regret can be recovered.
Joshi et al. (Fri,) studied this question.