March 22, 2024

DeepCTS: A Deep Reinforcement Learning Approach for AI Container Task Scheduling

Key Points

Key points are not available for this paper at this time.

Abstract

Container technology is a new paradigm of virtualization technology that has developed very rapidly in recent years. In view of container clusters for AI tasks, tasks can be divided into training tasks and inference tasks. Among them, the scheduling delay sensitivity of training tasks is low and the execution time is long, while the scheduling delay sensitivity of inference tasks is high and the execution time is short. However, prior works have rarely considered the characteristics of training and inference tasks, leading to excessive completion latency for latency-sensitive inference tasks and insufficient resources for resource-sensitive training tasks. There are also imbalanced resource usage within nodes and imbalanced resource usage among nodes in container clusters due to lack of consideration of the characteristics of AI tasks.In addition, an online learning algorithm is needed to adaptively make decisions based on the dynamics of task arrival. Therefore, this paper proposes an AI container cloud resource scheduling algorithm named DeepCTS based on reinforcement learning based on the characteristics of different latency sensitivities of training tasks and inference tasks in AI container clusters. DeepCTS takes the resource usage status of cluster nodes and container tasks characteristics as input, and from the perspective of different types of tasks with different delay sensitivity, in order to reduce tasks waiting time and balance cluster resource load. And through action mask filtering, it guides reinforcement learning agents to prioritize inference tasks in the process of interacting with the environment to reduce the waiting time of inference tasks, and at the same time make full use of the idle resources in the cluster when tasks resource requests are low. Compared with the existing scheduling algorithm based, the experimental results show that the average waiting time of tasks is reduced by 35.1% and the reduction of resource imbalance between nodes and the degree of resource imbalance within nodes are improved by 14.2% and 1.4% respectively.

Mark Helpful

Bookmark

Relay