Reverie achieves 94.6% on LongMemEval (n=500, GPT-4o judge), within 0.27 points of the top-performing system. A controlled Oracle experiment, running the same synthesis model (Claude Sonnet 4.6) with perfect retrieval, scores 93.4%, revealing that LongMemEval is model-dominated: the architecture contributes +1.2 points, concentrated in knowledge-update and multi-session categories where architectural features (supersession tracking, session summaries) directly apply. This pattern is not unique to Reverie; we estimate comparable architectural deltas across leaderboard systems. The system is a two-layer memory architecture: L1 stores raw conversational experiences losslessly, and L2 extracts declarative facts with LLM-confirmed supersession detection for knowledge updates. Both layers are searched with hybrid vector+keyword retrieval and synthesized by an LLM. The paper's primary contribution is methodological: an iterative build-test-prune development process in which every component was subjected to ablation, and several (including four additional layers, weight decay, contextual embeddings, and LLM-declared edges) were removed when they degraded performance or failed to justify their complexity.
Building similarity graph...
Analyzing shared references across papers
Loading...
Waleed Abdullah
Building similarity graph...
Analyzing shared references across papers
Loading...
Waleed Abdullah (Tue,) studied this question.
synapsesocial.com/papers/69b25b5496eeacc4fcec9f49 — DOI: https://doi.org/10.5281/zenodo.18943822