March 3, 2026Open Access

TNCOA: Efficient Exploration via Observation‐Action Constraint on Trajectory‐Based Intrinsic Reward

Key Points

The method achieves state‐of‐the‐art performance in convergence speed and average returns.
Key evidence shows strong generalization on high‐dimensional Atari benchmarks.
The approach employs a trajectory‐level novelty measure and incorporates mutual information between actions and novelty.
Highlights the need for effective strategies in environments characterized by sparse rewards and complex interactions.

Abstract

ABSTRACT Efficient exploration is critical in handling sparse rewards and partial observability in deep reinforcement learning. However, most existing intrinsic reward methods based on novelty rely on single‐step observations or Euclidean distances. These approaches struggle to capture trajectory‐level novelty and often perform poorly in partially observable settings. Moreover, they typically ignore the role of actions in driving observation changes, as not all actions lead to meaningful state transitions. To overcome these limitations, we propose a trajectory‐level novelty measure that estimates the novelty of a state by comparing current observations with past ones along the trajectory. To focus on meaningful exploration, we incorporate the mutual information between actions and trajectory novelty to filter out random fluctuations and retain only novelty caused by the agent's actions. Additionally, we introduce a first‐visit constraint on observation–action pairs, rewarding only interactions that result in state transitions to enhance exploration efficiency. We conducted experiments in the MiniGrid‐ObstructedMaze environment characterised by complex object interactions and sparse rewards. Results demonstrate that our method achieves state‐of‐the‐art performance in convergence speed and average returns. Furthermore, it shows strong generalisation on high‐dimensional Atari benchmarks and demonstrates robust performance in more challenging MiniGrid variants. Implementation code is available at: https://github.com/MurrayMa0816/TNCOA .

TNCOA: Efficient Exploration via Observation‐Action Constraint on Trajectory‐Based Intrinsic Reward

Key Points

Abstract

Cite This Study