Retrieval-Augmented Generation (RAG) has demonstrated substantial advancements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, the retrieval step introduces long sequence generation and extra data dependency, resulting in long end-to-end latency. Our analysis benchmarks current RAG systems and reveals that, while the retrieval step poses performance challenges, it also offers optimization opportunities through its retrieval pattern and streaming search behavior. We propose RAGCache, a latency-optimized serving system tailored for RAG. RAGCache leverages the retrieval pattern to organize and cache the intermediate states of retrieved knowledge in a knowledge tree across the GPU and host memory hierarchy, reducing LLM generation time. RAGCache employs dynamic speculative pipelining to exploit the streaming search behavior, overlapping retrieval with LLM generation to minimize end-to-end latency. We implement RAGCache based on vLLM and Faiss, and evaluate it on both open-source and production datasets. Experimental results demonstrate that RAGCache reduces the time to first token (TTFT) by up to 4 × and improves the throughput by up to 2.1 × compared to vLLM integrated with Faiss.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chao Jin
Zili Zhang
Xuanlin Jiang
ACM Transactions on Computer Systems
Peking University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jin et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68d46fcd31b076d99fa69ff3 — DOI: https://doi.org/10.1145/3768628