Data centers have to maintain service stability when satisfying power, energy efficiency, and carbon constraints; single layer control can cause cross subsystem oscillation under power limit and demand response event. A collaborative regulation system which combines workload orchestration, thermal aware cooling control and grid aware power budgeting in a hierarchical, multi-timescale controller is proposed. Coordination problem is resolved through constrained optimization with distributed consensus and safety guard makes the constraint stricter so that it can guarantee thermal and SLO feasibility even if there are prediction errors. On production-like traces and a high-fidelity digital twin, our method moves operating points towards the Pareto frontier, decreasing SLA violations and tail-latency spikes, and reducing PUE and cooling share; these improvements have been proven statistically significant at 95% confidence interval. System and Software Requirements are also reported as well as Scalability and Generalizability across sites. Cross-layer cooperation gives a working way to stay steady when there aren’t enough energy sources, and it makes things more able to handle problems and last longer. In general, cooperative regulation recovers faster, has less overshoot, and is more predictable than local-only methods.
Yaqi Hou (Thu,) studied this question.