Key points are not available for this paper at this time.
Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These servers consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method to adapt to various workloads. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning to select the best policy during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Compared to prior published techniques, our results show that PowerCoord improves the throughput by an average of 14.4% under power caps.
Building similarity graph...
Analyzing shared references across papers
Loading...
Reza Azimi
Chao Jing
Sherief Reda
Brown University
Guilin University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Azimi et al. (Mon,) studied this question.
www.synapsesocial.com/papers/6a16e18a0f965e9c137bb432 — DOI: https://doi.org/10.1109/igcc.2018.8752132