Standard reinforcement learning (RL) typically assumes that (1) the environment resets fully aftereach episode, and (2) learning relies on reward signals. These assumptions make it difficult tohandle tasks where a single episode is too short to reach the goal, and where necessary intermediateactions produce zero reward and terminate the episode.In this paper, we describe a simple extension: we split the environment state into an agent-specifictransient part and a global persistent part that is inherited across episodes. We also propose anevent-triggered experience buffer that stores zero-reward but goal-critical trajectories and reusesthem for behavior cloning in later episodes.We test this idea in the Sealed Corridor environment – a small grid world where the goal can onlybe reached after several generations of sacrificial actions. Experimental results across 30 randomseeds show that standard PPO often fails (60% failure rate) and requires many episodes (mean>2300) when it succeeds. DQN and a pure evolutionary baseline can occasionally succeed but areinefficient (mean >170 and >1000 episodes, respectively). In contrast, our method (with thepersistent state and experience buffer) achieves a mean of 70.2 episodes to first success andsucceeds in all 30 seeds (p < 1e-10 compared to baselines).These results, obtained in a very simple environment, suggest that adding inter-episode statepersistence and a mechanism to reuse zero-reward but critical trajectories may be beneficial fortasks that require multi-step coordination across episodes. The work is a proof-of-concept; theabsolute performance and the complexity of the environment are both low.
guoyong chen (Thu,) studied this question.