GPU clusters are increasingly important in high-performance computing environments for large-scale simulation, data analytics, and deep learning workloads.Meanwhile, Kubernetes has moved beyond cloudnative service orchestration and is increasingly discussed in research and practice as a platform for scientific computing, largely due to its declarative control paradigm, interoperability and complete ecosystem.The key issue here is to balance Kubernetes flexibility and the strict efficiency requirements of GPU-based HPC systems, where the latency sensitivity, topology information, accelerator usage, and finegrained observability have a strong impact on scientific throughput.This review examines peer-reviewed literature on GPU resource management, container orchestration, telemetry, and monitoring-driven optimization, with particular attention to Kubernetes-based implementations and custom monitoring agents.Themes such as scheduling under heterogeneous accelerator constraints, scientific workload container overhead, node-level and pod-level observability of GPUs, fairness and isolation, and feedback control and performance and energy optimization are major themes.It has been reported that orchestration per se can rarely provide maximum efficiency; quantifiable benefits are more often associated with scheduler extensions, topology-aware placement, and monitoring pipelines which are able to reveal the pressure on memory, streaming multiprocessor occupancy, I/O contention, and thermal or power behaviours.Persistent gaps include the lack of cross-layer metrics, limited support for multi-tenant GPU fragmentation, and insufficient validation at production-scale HPC environments.The discipline is of great importance owing to the fact that the exascale and AI-driven next-generation systems will need to be operating models that combine portability, policy control, and accelerator-aware observability.
Ratan Raj Anandeshi (Thu,) studied this question.