Visual reinforcement learning has exhibited efficacy in solving control tasks characterized by high-dimensional observations. However, a central challenge persists in deriving dependable and generalizable representations from vision-based observations. Inspired by the human thought process, when the visual representation extracted from the observation can predict the future and trace history, the representation is reliable and accurate in comprehending the environmental state. Based on this concept, we introduce a B idirectional T ransition (BT) framework for representation learning. This framework employs the bidirectional prediction of both forward and backward environmental transitions as auxiliary tasks to extract reliable representations. Additionally, we introduce an inverse dynamic model to predict the actions causing environmental state transitions, thereby learning the task relevance of state representations. Our method demonstrates competitive generalization performance and sample efficiency in two settings in the DeepMind Control suite. Moreover, we utilize the robotic manipulation simulator, autonomous driving simulator CARLA, and visual navigation simulator Habitat to demonstrate the wide applicability of our method. The results indicate that BT offers more stable and reliable representations and exhibits robust generalization performance for visual reinforcement learning tasks.
Hu et al. (Tue,) studied this question.