What type of study is this?

September 5, 2025Open Access

Q-Infer: Towards Efficient GPU-CPU Collaborative LLM Inference via Sparsity-Aware Dynamic Scheduling

Key Points

Q-Infer significantly enhances inference performance while ensuring accuracy in large language models.
The dynamic caching strategy leverages model sparsity to optimize parameter management, leading to improved efficiency.
A multi-window-based approach selects important tokens, effectively reducing the KV cache without sacrificing accuracy.
A novel GPU-CPU collaborative inference strategy adapts to various hardware configurations, overcoming latency and throughput challenges.

Abstract

Large Language Models (LLMs) have sparked a new wave of exciting AI applications, yet their large model size imposes significant computational and storage costs during inference. Offloading parameters to the CPU and conducting GPU-CPU collaborative inference is a highly cost-effective strategy to alleviate GPU memory constraints. However, current solutions struggle to balance latency and throughput, and suffer from accuracy loss and performance fluctuations under various workloads and configurations. In this paper, we propose Q-Infer, an efficient GPU-CPU collaborative inference system that significantly improves the performance and quality of LLM inference through several optimizations: 1) Q-Infer designs a dynamic caching strategy for important parameters by exploiting model sparsity and locality. 2) Q-Infer proposes a multi-window-based approach for selecting important tokens, which reduces the KV cache while maintaining high accuracy. 3) Q-Infer develops a novel GPU-CPU collaborative inference and dynamic scheduling strategy to enhance performance across different environments. We evaluate Q-Infer using various models and workloads across different hardware configurations. The results demonstrate Q-Infer’s superior inference performance while retaining model accuracy compared to state-of-the-art GPU-CPU systems.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper