What question did this study set out to answer?

The aim is to address memory limitations in LLM serving by utilizing CXL technology for efficient KVCache management.

April 10, 2026Open Access

Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Key Points

The aim is to address memory limitations in LLM serving by utilizing CXL technology for efficient KVCache management.
Developed Beluga, a memory architecture enabling shared access of large memory pools via CXL switches.
Implemented Beluga-KVCache for managing large-scale KVCache in LLM inference.
Characterized CXL-based memory pools and created design guidelines.
Achieved an 89.6% reduction in Time-To-First-Token (TTFT).
Demonstrated a 7.35x improvement in throughput compared to RDMA-based solutions.
Enabled direct GPU access to large-scale memory pools, reducing latency and programming complexity.

Abstract

The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated LLM serving. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support large KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga , a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of CXL-based memory pool and propose a set of design guidelines. Based on Beluga , we design and implement Beluga -KVCache, a system tailored for managing the large-scale KVCache for LLM inference. Beluga -KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in vLLM compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches (Marvell XConn XC50256), marking a significant step toward low-latency, shared access to vast memory resources by GPUs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Xinjun Yang

Alibaba Group (United States)

Qingda Hu

Alibaba Group (China)

J. L. Li

Alibaba Group (China)

Journals

Proceedings of the ACM on Management of Data

Actions

Institutions

Alibaba Group (China)

Alibaba Group (United States)

Cloud Computing Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider