What question did this study set out to answer?

This work aims to develop a scheduler that optimizes RAG and large-language-model inference across multiple clouds while ensuring privacy.

May 20, 2026Open Access

Cross-Cloud LLMOps Scheduler for Privacy-Budgeted RAG and Inference

Key Points

This work aims to develop a scheduler that optimizes RAG and large-language-model inference across multiple clouds while ensuring privacy.
Proposed Cross-Cloud LLMOps Scheduler (CCLS) architecture for routing and inference.
Implemented policy-verified execution for LLM workloads with privacy accounting.
Simulated benchmarking over compliance Q&A and operational summarization with a focus on latency and evidence freshness.
Reduced P95 request latency by 40.5% compared to static routing.
Improved evidence-supported answer recall from 82.4% to 91.8%.
Eliminated privacy-budget overruns and unauthorized policy violations.

Abstract

Enterprise retrieval-augmented generation (RAG) and large-language-model inference increasingly run across multiple cloud providers, vector stores, data catalogs, and model endpoints. The operational scheduler must therefore optimize latency and cost without violating data residency, gateway intent policies, model-context-protocol contracts, or privacy budgets. This paper proposes a Cross-Cloud LLMOps Scheduler (CCLS), a synthetic architecture for routing extraction, retrieval, context construction, and inference under explicit privacy accounting and evidence-maintenance constraints. CCLS extends Policy-Verified Agentic DataOps for Regulated Multi-Cloud Analytics by applying policy-verified execution to LLM workloads, and it extends Retrieval-Grounded Documentation Agents for Enterprise Compliance Evidence by treating compliance evidence freshness as a first-class scheduling signal. The scheduler combines governed API intents, cross-cloud workload placement, distributed RAG, MCP tool contracts, anonymized evidence views, and latency-aware sequence models for context selection. We define the architecture, a multi-objective scheduling model, and a simulated benchmark over compliance Q&A, operational summarization, and governed decision-support traffic. In simulation, CCLS reduces P95 request latency by 40.5% relative to static approved-cloud routing, improves evidence-supported answer recall from 82.4% to 91.8%, and eliminates privacy-budget overruns and unauthorized policy violations.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Pasupuleti et al. (Mon,) studied this question.

synapsesocial.com/papers/6a0d4fbff03e14405aa9b2e3 https://doi.org/https://doi.org/10.5281/zenodo.20265856

Bookmark

View Full Paper