Model-based reinforcement learning agents that plan entirely in imagination can achieve high imagined returns while completely failing the actual task — a failure mode we term the exploitation gap. We provide the first systematic characterisation of this gap in DreamerV3 on AntMaze, where the world model receives near-zero reward from real experience. Instrumenting the training loop with four new metrics, we show that the imagined-to-real reward ratio reaches approximately 50x at 500k environment steps while evaluation return stays below 0.05. We establish that KL divergence collapse is a leading indicator of exploitation onset with a approximately 50k step lag (r = -0.91, p < 0.001), providing an actionable early-warning signal. Comparing to the hierarchical baseline THICK, we show that sparse context-kernel gating reduces but does not eliminate the gap. A dense-reward ablation confirms that rich reward signal suppresses exploitation entirely. We propose three KL-aware mitigation strategies and release all experimental infrastructure for reproducibility.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arkat Khassanov Arkat Khassanov
Astana Medical University
Building similarity graph...
Analyzing shared references across papers
Loading...
Arkat Khassanov Arkat Khassanov (Thu,) studied this question.
www.synapsesocial.com/papers/69f44325967e944ac55667ca — DOI: https://doi.org/10.5281/zenodo.19894702
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: