Reverie achieves 94.6% on LongMemEval (n=500, GPT-4o judge), within 0.27 points of the top-performing system. A controlled Oracle experiment, running the same synthesis model (Claude Sonnet 4.6) with perfect retrieval, scores 93.4%, revealing that LongMemEval is model-dominated: the architecture contributes +1.2 points, concentrated in knowledge-update and multi-session categories where architectural features (supersession tracking, session summaries) directly apply. This pattern is not unique to Reverie; we estimate comparable architectural deltas across leaderboard systems. The system is a two-layer memory architecture: L1 stores raw conversational experiences losslessly, and L2 extracts declarative facts with LLM-confirmed supersession detection for knowledge updates. Both layers are searched with hybrid vector+keyword retrieval and synthesized by an LLM. The paper's primary contribution is methodological: an iterative build-test-prune development process in which every component was subjected to ablation, and several (including four additional layers, weight decay, contextual embeddings, and LLM-declared edges) were removed when they degraded performance or failed to justify their complexity.
Waleed Abdullah (Tue,) studied this question.