Large language models (LLMs) are powering a new wave of language-based applications, including database applications, leading to new techniques and systems for dealing with the enormous compute and memory needs of LLMs, coupled with advances in computing hardware. In this tutorial, we review how these techniques lower inference costs by managing uncertain request lifecycles, exploiting specialized hardware, and scaling over distributed inference devices and machines. We present these techniques from the database perspective of request processing, model execution and optimization, and memory management. Following these discussion, we review how inference systems combine these techniques in diverse architectures to achieve application or performance objectives.
Building similarity graph...
Analyzing shared references across papers
Loading...
James Pan
Guoliang Li
Proceedings of the VLDB Endowment
Tsinghua University
Building similarity graph...
Analyzing shared references across papers
Loading...
Pan et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68d46ccf31b076d99fa69120 — DOI: https://doi.org/10.14778/3750601.3750703