What question did this study set out to answer?

The study aims to explore how agents in multi-agent reinforcement learning can maintain a persistent latent identity through specific mechanisms.

March 22, 2026Open Access

Persistent Latent Identity in Multi-Agent Reinforcement Learning: Bilateral Replay, Sleep Consolidation, and Cross-Environment Generalization

Key Points

The study aims to explore how agents in multi-agent reinforcement learning can maintain a persistent latent identity through specific mechanisms.
Developed a multi-agent reinforcement learning architecture with latent identity states.
Tested the architecture in three distinct cooperative and competitive grid worlds.
Analyzed the effects of bilateral replay, sleep processing, and coherence governance across environments.
Identity trait orderings emerged without supervision across all environments.
Bilateral replay and sleep processing showed a super-additive effect on performance.
Removing specific components of bilateral replay significantly impacted reward optimization and trait differentiation.

Abstract

We introduce a multi-agent reinforcement learning architecture in which agents maintain a persistent latent identity state updated through bilateral perspective replay and consolidated through offline sleep processing. We test the system across three cooperative and competitive grid worlds with structurally different interaction structures and confirm five findings: (1) the identity system develops environmentally appropriate trait orderings without supervision across all three environments; (2) bilateral replay and sleep interact super-additively across all three environments; (3) a term-level decomposition of the bilateral replay signal isolates two distinct mechanisms — the value difference term drives reward improvement through optimization stabilization (−10% when removed), and the relational alignment term drives environment-specific trait differentiation with ordering collapse when removed (−7–8%), while the KL divergence term is negligible (−1%) — confirmed across 4 seeds × 3 environments; (4) coherence governance in gate mode produces identical learning outcomes to monitor mode with zero persistent rejections; and (5) preliminary evidence (seed=42) shows governance provides 2.4–2.6× character stability under adversarial transfer. Extended phases for recursive self-modification, environment co-evolution, cross-generational latent accumulation, and open-ended discovery are implemented and described as preliminary.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper