Large Language Models (LLMs) have sparked a new wave of exciting AI applications, yet their large model size imposes significant computational and storage costs during inference. Offloading parameters to the CPU and conducting GPU-CPU collaborative inference is a highly cost-effective strategy to alleviate GPU memory constraints. However, current solutions struggle to balance latency and throughput, and suffer from accuracy loss and performance fluctuations under various workloads and configurations. In this paper, we propose Q-Infer, an efficient GPU-CPU collaborative inference system that significantly improves the performance and quality of LLM inference through several optimizations: 1) Q-Infer designs a dynamic caching strategy for important parameters by exploiting model sparsity and locality. 2) Q-Infer proposes a multi-window-based approach for selecting important tokens, which reduces the KV cache while maintaining high accuracy. 3) Q-Infer develops a novel GPU-CPU collaborative inference and dynamic scheduling strategy to enhance performance across different environments. We evaluate Q-Infer using various models and workloads across different hardware configurations. The results demonstrate Q-Infer’s superior inference performance while retaining model accuracy compared to state-of-the-art GPU-CPU systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kai Lü
Qiang Wei
Yan Lin
ACM Transactions on Architecture and Code Optimization
Huazhong University of Science and Technology
Wuhan National Laboratory for Optoelectronics
Huawei Technologies (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Lü et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68bb3ef72b87ece8dc95797a — DOI: https://doi.org/10.1145/3764589