Key points are not available for this paper at this time.
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing. Their massive computational and memory requirements often necessitate cloud-based deployment, introducing challenges related to cost, latency, privacy, and network reliability. Deploying on-device LLMs alleviates these challenges, but is hindered by the severe resource constraints of edge hardware. This survey reviews efficient inference techniques for edge LLMs, with a focus on two key strategies of speculative decoding and model offloading. We categorize strategies into single-device and multi-device types, systematically analyzing the principles, recent advancements, implementations, and support within edge frameworks. Finally, we highlight the open challenges and future research directions that will advance the field of edge LLM inference.
Cai et al. (Fri,) studied this question.