This artifact release contains the benchmark, reference implementation, wrappers, scoring code, and result snapshots for “Reliable Agent Memory Is an Interface-Resolution Problem: A Benchmark for Temporal Belief Reconstruction.” The benchmark evaluates whether agent memory systems can reconstruct changing belief state with valid time, transaction time, evidence, and epistemic status. Its central diagnostic is cascading staleness: a system may retrieve an answer-correct claim while failing to flag that one of the claim’s transitive premises has been superseded. The benchmark also includes matched false-stale specificity cases, which test whether a system avoids marking a claim stale when a nearby update is outside the query’s valid-time scope. The release includes:- synthetic benchmark corpora covering corporate, fictional, regulatory, legal-amendment, and stress-test scenarios;- a deterministic bitemporal oracle and structural scorer;- reference DuckDB and SQLite implementations;- no-LLM baselines, retrieval baselines, graph baselines, and long-context LLM wrappers;- wrappers and/or result snapshots for evaluated agent-memory systems where applicable;- premise-walk ablations over declared derivation links;- provenance-quality perturbation and extraction-bound experiments;- per-query TSV outputs, aggregate result tables, and reproducibility scripts. The artifact is intended to support reproducible evaluation of agent-facing memory interfaces. In particular, it tests whether a memory system exposes enough temporal and provenance structure to traverse the dependency closure of a derived claim and detect stale premises under valid-time and transaction-time constraints. The artifact does not solve open-ended provenance extraction; it evaluates read-path behavior conditional on derivation links being represented or supplied. All expected answers and epistemic labels are derived from the formal oracle, not from system outputs or LLM judges. Scoring is deterministic and structural.
Kai Hirota (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: