What question did this study set out to answer?

This research aims to improve conversational memory management for large language models by reducing token use without losing important information.

March 17, 2026Open Access

Token-Efficient Conversational History Management using Sparse Semantic Patch Memory

Key Points

This research aims to improve conversational memory management for large language models by reducing token use without losing important information.
Updated the Sparse Semantic Patch Memory (SSPM) framework with a full architectural design.
Utilized Gemini 2.5 Flash for experimental validation and semantic extraction.
Implemented a modular pipeline that includes semantic extraction, utility scoring, sparse selection, indexed memory, and prompt composition.
Achieved a 48.7% average reduction in tokens across multi-turn dialogues.
Preserved 100% of explicit constraints and decisions during conversation analysis.
Significantly outperformed traditional methods like raw history and token truncation.

Abstract

Updated version of the Sparse Semantic Patch Memory (SSPM) framework including full architecture, Gemini 2.5 Flash based experimental validation, and extended empirical evaluation demonstrating significant token reduction while preserving reasoning constraints.Large Language Model (LLM) deployments face practical limitations due to restricted context windows, increasing inference costs, and the accumulation of redundant conversational history. Traditional approaches such as token truncation or surface-level compression often discard important constraints and decisions, leading to degraded reasoning continuity. This repository presents Sparse Semantic Patch Memory (SSPM) — a semantics-first conversational memory architecture that represents dialogue history as compact utility-scored semantic patches rather than raw token sequences. Each conversational turn is decomposed into structured units such as entities, constraints, decisions, code snippets, equations, and structural cues, which are extracted using a DeepSeek-style semantic extraction pipeline implemented with Gemini 2.5 Flash. The SSPM framework applies schema-guided extraction, composite utility scoring, and dependency-aware greedy knapsack selection to retain only the most valuable patches under a strict token budget. Selected patches are stored in an indexed sparse memory structure and dynamically composed with the current query to form a compact prompt for downstream reasoning. Empirical evaluation across five multi-turn technical dialogues demonstrates that SSPM achieves an average token reduction of 48.7% while preserving 100% of explicit constraints and decisions, significantly outperforming conventional raw history, truncation, and compression baselines. The system is implemented as a fully modular pipeline consisting of semantic extraction, utility scoring, sparse selection, indexed memory storage, and prompt composition. This design enables scalable, cost-aware conversational memory management and provides a practical foundation for building long-context reasoning systems and agentic LLM architectures.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Dhruv Dubey (Sun,) studied this question.

synapsesocial.com/papers/69b8f0fddeb47d591b8c5c9e https://doi.org/https://doi.org/10.5281/zenodo.19033643

Bookmark

View Full Paper