What question did this study set out to answer?

This work aims to improve the deployment of AI models in hybrid cloud-edge environments while ensuring low latency and high throughput.

June 21, 2026Open Access

Learning-based orchestration for low-latency AI deployment in hybrid cloud–edge platforms

Key Points

This work aims to improve the deployment of AI models in hybrid cloud-edge environments while ensuring low latency and high throughput.
Introduced a deep learning-based scheduling system for deploying AI models on cloud edges.
Utilized real-time system telemetry and features derived from MLPerf Inference Benchmark.
Tested four MLPerf inference workloads across different batch sizes and latency thresholds.
Achieved deployment success rates > 90% for unseen models like GPT 2 and YOLOv5.
Demonstrated latency reduction and throughput improvements.
The effectiveness of the solution relies on the profiling data and the similarity of training and deployment environments.

Abstract

Abstract The rapid growth of AI-driven applications in hybrid cloud–edge environments poses substantial challenges to ensuring low latency, high throughput, and effective resource utilization. Conventional deployment models, which are typically fixed or policy-driven, are not sufficiently flexible to respond dynamically to changing workloads and heterogeneous hardware environments. In this work, we introduce and analyze a resource-conscious deep learning-based scheduling system for managing the deployment of AI models on distributed cloud edges. The framework improves inference performance by leveraging real-time system telemetry and model features generated by benchmarks, while maintaining quality of service (QoS) compliance. The proposed system uses a fully connected neural network trained on structured features derived from the MLPerf Inference Benchmark, including compute complexity, memory footprint, and input dimensions. It is guided by real-time data from a hybrid infrastructure (NVIDIA A100/V100 GPUs and Jetson Xavier edge devices) to inform scheduling. Four MLPerf inference workloads – ResNet 50, BERT, SSD ResNet34, and DLRM – were tested and contrasted across various batch sizes and latency thresholds. Generalization experiments with unseen models such as GPT 2 and YOLOv5 yielded > 90% success rates in deployment, with the latency reduction and throughput gain results as presented above. Results of the generalization experiments with unseen models, including GPT 2 and YOLOv5, demonstrated deployment success rates > 90% for the various profiling conditions evaluated, with the latency reduction and throughput improvements as shown above. The results show that learning-based orchestration can be used to deliver space- and resource-aware orchestration solutions that are adaptive for low-latency deployment of AI services in hybrid cloud edge systems, but the effectiveness of the solution will depend on the representativeness of the profiling data and similarity of training and deployment environments.

KI fragen

Bookmark

View Full Paper

Cite This Study

Ahmed Albugmi (Sat,) studied this question.

synapsesocial.com/papers/6a3780b224f042ddf4c5ac26 https://doi.org/https://doi.org/10.1038/s41598-026-58531-w

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark

View Full Paper