What type of study is this?

This is a Experimental Study study.

September 20, 2025Open Access

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Key Points

RAGCache reduces generation time by optimizing the retrieval step and caching knowledge effectively.
Experimental results show RAGCache achieves a 4 × improvement in time to first token (TTFT) and 2.1 × better throughput.
This approach uses dynamic speculative pipelining to overlap retrieval with LLM generation, minimizing delays.
Benchmarks indicate significant opportunities for optimization in retrieval-augmented generation systems.

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated substantial advancements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, the retrieval step introduces long sequence generation and extra data dependency, resulting in long end-to-end latency. Our analysis benchmarks current RAG systems and reveals that, while the retrieval step poses performance challenges, it also offers optimization opportunities through its retrieval pattern and streaming search behavior. We propose RAGCache, a latency-optimized serving system tailored for RAG. RAGCache leverages the retrieval pattern to organize and cache the intermediate states of retrieved knowledge in a knowledge tree across the GPU and host memory hierarchy, reducing LLM generation time. RAGCache employs dynamic speculative pipelining to exploit the streaming search behavior, overlapping retrieval with LLM generation to minimize end-to-end latency. We implement RAGCache based on vLLM and Faiss, and evaluate it on both open-source and production datasets. Experimental results demonstrate that RAGCache reduces the time to first token (TTFT) by up to 4 × and improves the throughput by up to 2.1 × compared to vLLM integrated with Faiss.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Jin et al. (Sat,) studied this question.

synapsesocial.com/papers/68d46fcd31b076d99fa69ff3 https://doi.org/https://doi.org/10.1145/3768628

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper