What question did this study set out to answer?

The aim is to enhance the efficiency of multi-tenant serving infrastructures for large language models by addressing performance bottlenecks during autoregressive decoding.

April 3, 2026Open Access

DisaggKV: Scalable and Disaggregated CXL-PIM Pooling for Multi-Tenant LLM Serving

Key Points

The aim is to enhance the efficiency of multi-tenant serving infrastructures for large language models by addressing performance bottlenecks during autoregressive decoding.
Proposed DisaggKV framework integrates near-data logic with CXL 3.0
Designed a Hypervisor-level Disaggregated OS Scheduler with Locality-Aware Page Tables
Developed an asynchronous distributed synchronization barrier for Global Softmax normalization
Evaluated using a modified CXL-extended Ramulator 2.0 framework with ShareGPT and Qwen2.5 traces
Achieved over 92% reduction in global switch traffic
Scaled near-linearly under a strict 50 ms tail-latency limit across 8 nodes
Significantly exceeded throughput of classical Host-Aggregated pools
Enabled fault resilience within hundreds of milliseconds during error correction

Abstract

The explosive demand for Large Language Models (LLMs) has pushed multi-tenant serving infrastructures to their physical limits. Unbounded sequence lengths and heavy concurrent batching generate immense Key-Value (KV) cache footprints that rapidly exhaust local GPU High-Bandwidth Memory (HBM). While Compute Express Link (CXL) enables seamless cache-coherent physical memory pooling across racks, accessing disaggregated standard CXL memory arrays during the autoregressive decode phase imposes significant performance degradation. Repeatedly fetching dense historical KV vectors across the Host-indirected CXL fabric to resolve Attention dot-products saturates PCIe bandwidth, violating strict 99\% tail-latency Service-Level Agreements (SLAs). In this paper, we propose DisaggKV, a scalable processing-in-disaggregated-memory framework that fundamentally rearchitects multi-tenant LLM serving interconnects. By integrating near-data logic with CXL 3. 0 Peer-to-Peer (P2P) fabric capabilities, DisaggKV encapsulates Attention reduction operations entirely within the remote endpoints. To orchestrate this, we design a Hypervisor-level Disaggregated OS Scheduler featuring Locality-Aware Page Tables that inherently separate shared ``hot'' prompts from independent private contexts across clustered CXL nodes. To facilitate hardware scalability, we propose an asynchronous distributed synchronization barrier that computes Global Softmax normalization autonomously across the fabric without traversing the Host I/O switch, thereby preventing port congestion and deadlocks. Evaluated on a heavily modified, CXL-extended Ramulator 2. 0 framework driven by realistic ShareGPT-derived and Qwen2. 5-7B-Instruct multi-tenant traces, DisaggKV demonstrates substantial improvements. Event-driven queuing simulation on CXL topologies reveals over 92\% reduction in global switch traffic. Under a strict 50\, ms tail-latency limit, DisaggKV scales near-linearly across 8 independent nodes, significantly exceeding the throughput of classical Host-Aggregated pools. Furthermore, its P2P error correction protocol achieves fault resiliency within sub-second timelines (hundreds of milliseconds), providing strong robustness for large-scale clustered deployments compared to multi-second software checkpoint restarts.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper