What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Key Points

Latency is significantly reduced by around 2x using the PredGen framework in real-time voice interactions.
The approach utilizes speculative decoding during user input, allowing for faster generation of audio responses.
Simulations on Lmsys and MT-Bench datasets confirm effective performance improvements with minimal added computation.
This method addresses latency issues commonly faced with LLMs on consumer hardware, enhancing user experience.

Abstract

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Key Points

Abstract

Cite This Study

Also Consider

Also Consider