What question did this study set out to answer?

To develop a disaggregated KV-cache architecture for efficient LLM deployment in datacenters.

December 19, 2025Open Access

CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving

Key Points

To develop a disaggregated KV-cache architecture for efficient LLM deployment in datacenters.
Introduced CXL-SpecKV architecture leveraging CXL interconnects and FPGA accelerators.
Implemented a speculative KV-cache prefetching mechanism.
Developed FPGA-accelerated compression and decompression engine.
Achieved up to 3.2× higher throughput compared to GPU-only baselines.
Reduced memory costs by 2.8× while maintaining accuracy.
Demonstrated intelligent memory disaggregation addressing the memory wall challenge.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing tasks, but their deployment in datacenter environments faces significant challenges due to the massive memory requirements of key-value (KV) caches. During the autoregressive decoding process, KV caches consume substantial GPU memory, limiting batch sizes and overall system throughput. To address these challenges, we propose CXL-SpecKV, a novel disaggregated KV-cache architecture that leverages Compute Express Link (CXL) interconnects and FPGA accelerators to enable efficient speculative execution and memory disaggregation. Our approach introduces three key innovations: (i) a CXL-based memory disaggregation framework that offloads KV-caches to remote FPGA memory with low latency, (ii) a speculative KV-cache prefetching mechanism that predicts and preloads future tokens' cache entries, and (iii) an FPGA-accelerated KV-cache compression and decompression engine that reduces memory bandwidth requirements by up to 4. When evaluated on state-of-the-art LLM models, CXL-SpecKV achieves up to 3. 2 higher throughput compared to GPU-only baselines, while reducing memory costs by 2. 8 and maintaining accuracy. Our system demonstrates that intelligent memory disaggregation combined with speculative execution can effectively address the memory wall challenge in large-scale LLM serving. Our code implementation has been open-sourced at https: //github. com/FastLM/CXL-SpecKV.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper