What question did this study set out to answer?

To propose a fundamentally reimagined AI-native operating system optimized for orchestrated multi-model inference at the edge.

June 30, 2026Open Access

Designing the AI-Native Operating System: An Architecture for Orchestrated Multi-Model Inference at the Edge

Key Points

To propose a fundamentally reimagined AI-native operating system optimized for orchestrated multi-model inference at the edge.
Developed SlyOS, comprising an application layer, cognitive scheduler, inference runtime, and accelerators.
Formalized inference scheduling as multi-objective optimization under key constraints.
Established memory-management subsystems for persistence in long-lived sessions.
Identified the key-value cache as a scheduling bottleneck, necessitating enhanced memory management.
Showcased continuous runtime attestation for trust verification in untrusted edge nodes.
Provided comprehensive system stack diagrams and scheduling-flow visualizations supporting the architectural proposal.

Abstract

The classical operating system abstraction—built around deterministic instruction execution and thread scheduling—poorly fits modern foundation models, whose workloads are probabilistic, parameter-heavy, and memory-bandwidth-bound. This paper proposes a fundamentally reimagined AI-native operating system whose primitive scheduling unit is an inference request routed across heterogeneous federation of language and perception models, rather than a thread bound to a single core. The emergence of custom inference silicon (such as OpenAI's Jalapeño processor) and standardized agent-coordination protocols makes this architectural shift both possible and necessary. We present SlyOS, a layered reference architecture comprising an agentic application layer, a cognitive scheduler and orchestration plane, an inference runtime, and heterogeneous edge accelerators with optional cloud burst capacity. The core contribution is a formalization of inference scheduling as continuous multi-objective optimization under latency, energy, and integrity constraints. A placement model with provable convergence operates under intermittent connectivity, treating model selection and fallback as deterministic optimization problems rather than heuristics. At the runtime layer, we establish the key-value cache as the true scheduling bottleneck and develop memory-management subsystems for long-lived agentic sessions with cross-session persistence and tiered placement. A critical design principle is continuous runtime attestation for heterogeneous inference pipelines, recognizing that untrusted edge nodes require trust verification beyond boot-time validation. We distinguish genuinely orchestrated inference from cosmetic model chaining, provide the full system stack including fault-tolerance schemes for offline-first operation, and identify open problems in the memory hierarchy, thermal envelopes, and comparative positioning against existing infrastructure. The architecture is presented with reproducible system diagrams, scheduling-flow visualizations, and quantitative comparisons.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper