Key points are not available for this paper at this time.
Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads. Efficiently managing and scheduling these workloads via SLURM interfaces requires users to comprehensively understand available resources, allocation policies, and suitable execution configurations aligned with their models' estimated resource requirements and constraints. Typically, scheduling jobs involves using default configurations, adjusting them as needed, or requesting maximum available limits to ensure uninterrupted execution. However, this approach can lead to job interruptions due to underprovisioning, prolonged wait times, inefficient resource utilization, and increased costs from overprovisioning. These issues ultimately degrade cluster performance, emphasizing the need for a more efficient solution like an AI-enabled Scheduler framework that can profile the DNN workloads and estimate and provision resources dynamically. The existing resource estimation models are trained independently to predict various aspects of batch processing and scheduling, which do not work cohesively to orchestrate a job execution. In our work, we propose to introduce a framework that investigates the feasibility of implementing an iScheduler framework, which transforms the traditional SLURM resource provisioning workflow into an AI-enabled scheduler that plugs different estimators where needed to orchestrate workflow by generating a cyberinfrastructure-aware execution plan, schedules and monitors jobs till completion. We demonstrate the feasibility of our framework by orchestrating a user-specific DNN training workload.
Vallabhajosyula et al. (Wed,) studied this question.