Large Language Models (LLMs) are widely used in continuous, multi-turn conversational settings. However, as conversations progress, the accumulation of dialogue history leads to an increase in context length, resulting in significant response latency due to repeated processing of prior tokens during inference. This work presents an enhanced Memory-Block Protocol (MBP) for latency optimization in long-context GPT-based dialogues. The proposed approach segments conversations into structured memory blocks, generates compact semantic summaries, extracts persistent state tokens, and maintains a task ledger to preserve workflow continuity. A controlled context refresh mechanism is applied to replace redundant historical context with a compressed representation while retaining essential semantic information. Theoretical analysis demonstrates that the proposed method transforms the quadratic latency growth associated with self-attention into effectively constant-time behavior after context stabilization. Experimental evaluation across multiple task domains—including reasoning, programming, algorithmic analysis, and summarization—shows that MBP significantly reduces response latency while maintaining high semantic coherence. The proposed framework is model-agnostic and operates entirely at the prompt level, requiring no modification to the underlying architecture. This makes it suitable for practical deployment in long-running conversational systems where efficiency and continuity are critical.
Mayank et al. (Thu,) studied this question.