What does this research mean for the field?

The proposed driver attention prediction method (DAFNet) outperforms existing methods in predicting driver attention in complex road traffic scenarios. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to enhance driver attention prediction using an adaptive feature fusion method in complex traffic situations.

February 25, 2026Open Access

Driver Attention Prediction Based on Adaptive Fusion of Cross-Modal Features

Key Points

This research aims to enhance driver attention prediction using an adaptive feature fusion method in complex traffic situations.
Implemented semantic segmentation on input image sequences.
Developed a dual-branch encoder with a 3D residual network for spatio-temporal feature extraction.
Introduced a 3D deformable attention mechanism to refine the Transformer algorithm.
Employed a predictive recurrent neural network for long-term sequence forecasting.
Used a lightweight decoder to produce driver attention predictions.
The method surpasses comparative techniques in overall performance.
Predictions effectively capture key areas in driving scenes and track driver intent.
Achieved inference speed of 53.73 frames per second, meeting real-time requirements.

Abstract

To investigate the dynamic changes in driver attention in complex road traffic scenarios, this paper proposes a driver attention prediction method based on cross-modal adaptive feature fusion (DAFNet). First, semantic segmentation is applied to the input image sequences, and a dual-branch encoder using a 3D residual network is designed to extract spatio-temporal features from both RGB images and semantic information in parallel. Next, a 3D deformable attention mechanism is introduced to enhance the traditional Transformer algorithm, which focuses on the key salient regions through spatio-temporal offset prediction and adaptive fusion of cross-modal features. Subsequently, a predictive recurrent neural network is employed to forecast the fused spatio-temporal features and improve the stability of long-term sequence prediction. Finally, the driver attention results are predicted by a lightweight decoder. Experimental results demonstrate that the proposed method outperforms the comparative methods in overall performance. The predictions not only capture salient regions in driving scenes in a bottom-up manner but also track the driver’s intent in a top-down manner. Thus, our method exhibits strong adaptability to various complex traffic scenarios. Additionally, the method achieves an inference speed of 53.73 frames per second, satisfying the real-time performance requirement of on-vehicle systems.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper