With the increasing deployment of autonomous driving systems, the opaque nature of deep reinforcement learning (DRL) decision models hinders understanding and validation of driving decisions. To address this challenge, we propose a Hybrid Attribution-based Interpretable Deep Reinforcement Learning framework (HA-IDRL) for autonomous driving behavior decision-making. The framework introduces a Hybrid Gradient–LRP (HGL) attribution mechanism that integrates gradient-based attribution and Layer-wise Relevance Propagation (LRP) to capture complementary sensitivity and contribution information, producing more consistent and comprehensive post hoc explanations. In addition to post hoc interpretability, we enhance structural interpretability by replacing the conventional multilayer perceptron (MLP) in the Dueling Deep Q-Network (Dueling DQN) architecture with Kolmogorov–Arnold Networks (KAN). By representing nonlinear interactions through learnable univariate functions and explicit summation structures, KAN provides inherently interpretable functional decompositions. The proposed framework is evaluated on a highway lane-changing task using the highway-env simulator. Experimental results show that HA-IDRL achieves decision-making performance comparable to representative DRL baselines, including Dueling DQN and Soft Actor-Critic (SAC), while providing explanations that are more stable and better aligned with human driving semantics. Moreover, the proposed method produces explanations with low computational overhead, enabling efficient and real-time interpretability in practical autonomous driving applications. Overall, HA-IDRL advances trustworthy autonomous driving by enabling high-performance decision-making and rigorous, multi-level interpretability, thereby improving the transparency and operational reliability of DRL-based driving policies.
Liu et al. (Mon,) studied this question.