To investigate the dynamic changes in driver attention in complex road traffic scenarios, this paper proposes a driver attention prediction method based on cross-modal adaptive feature fusion (DAFNet). First, semantic segmentation is applied to the input image sequences, and a dual-branch encoder using a 3D residual network is designed to extract spatio-temporal features from both RGB images and semantic information in parallel. Next, a 3D deformable attention mechanism is introduced to enhance the traditional Transformer algorithm, which focuses on the key salient regions through spatio-temporal offset prediction and adaptive fusion of cross-modal features. Subsequently, a predictive recurrent neural network is employed to forecast the fused spatio-temporal features and improve the stability of long-term sequence prediction. Finally, the driver attention results are predicted by a lightweight decoder. Experimental results demonstrate that the proposed method outperforms the comparative methods in overall performance. The predictions not only capture salient regions in driving scenes in a bottom-up manner but also track the driver’s intent in a top-down manner. Thus, our method exhibits strong adaptability to various complex traffic scenarios. Additionally, the method achieves an inference speed of 53.73 frames per second, satisfying the real-time performance requirement of on-vehicle systems.
Zhang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: