The explosive demand for Large Language Models (LLMs) has pushed multi-tenant serving infrastructures to their physical limits. Unbounded sequence lengths and heavy concurrent batching generate immense Key-Value (KV) cache footprints that rapidly exhaust local GPU High-Bandwidth Memory (HBM). While Compute Express Link (CXL) enables seamless cache-coherent physical memory pooling across racks, accessing disaggregated standard CXL memory arrays during the autoregressive decode phase imposes significant performance degradation. Repeatedly fetching dense historical KV vectors across the Host-indirected CXL fabric to resolve Attention dot-products saturates PCIe bandwidth, violating strict 99\% tail-latency Service-Level Agreements (SLAs). In this paper, we propose DisaggKV, a scalable processing-in-disaggregated-memory framework that fundamentally rearchitects multi-tenant LLM serving interconnects. By integrating near-data logic with CXL 3. 0 Peer-to-Peer (P2P) fabric capabilities, DisaggKV encapsulates Attention reduction operations entirely within the remote endpoints. To orchestrate this, we design a Hypervisor-level Disaggregated OS Scheduler featuring Locality-Aware Page Tables that inherently separate shared ``hot'' prompts from independent private contexts across clustered CXL nodes. To facilitate hardware scalability, we propose an asynchronous distributed synchronization barrier that computes Global Softmax normalization autonomously across the fabric without traversing the Host I/O switch, thereby preventing port congestion and deadlocks. Evaluated on a heavily modified, CXL-extended Ramulator 2. 0 framework driven by realistic ShareGPT-derived and Qwen2. 5-7B-Instruct multi-tenant traces, DisaggKV demonstrates substantial improvements. Event-driven queuing simulation on CXL topologies reveals over 92\% reduction in global switch traffic. Under a strict 50\, ms tail-latency limit, DisaggKV scales near-linearly across 8 independent nodes, significantly exceeding the throughput of classical Host-Aggregated pools. Furthermore, its P2P error correction protocol achieves fault resiliency within sub-second timelines (hundreds of milliseconds), providing strong robustness for large-scale clustered deployments compared to multi-second software checkpoint restarts.
KAICHEN LI (Tue,) studied this question.