Efficient vigilance estimation in driving scenarios requires a balance between model performance and practicality. Electroencephalography (EEG), which can directly reflect brain activity, is widely used for vigilance estimation, but its acquisition process is complicated and difficult to apply to real-world driving. In contrast, physiological signals such as electrooculogram, electrodermal activity, and photoplethysmography have more advantages for practical deployment, but the information they provide is relatively limited. To address the above issues, we propose a delay-aware cross-modal knowledge distillation method. EEG signals are only used to train the teacher model. Then, an information-theoretic criterion based on mutual information and response delay is employed to determine which physiological signals are suitable as student modality for knowledge distillation from the EEG-based teacher model. On this basis, considering the inherent temporal differences caused by different physiological signals with varying sensitivities to cognitive responses, a delay-aware soft alignment mechanism (DASA) is proposed, which handles the temporal misalignment of different physiological signals and captures the asynchronous dynamics of the EEG and other physiological signals through the introduction of learnable delay and spread parameters at the patch level, to achieve soft, temporally-aligned supervision from the teacher to the student model. Finally, an objective function incorporating cross-modal consistency, patch level alignment, and smooth regularization is designed to support the effective training of the proposed cross-modal knowledge distillation method. Extensive experiments on MMV and SEED-VIG datasets validates that the proposed method outperforms existing methods in terms of estimation accuracy and temporal alignment while maintaining the real-time performance required for edge deployment.
Sun et al. (Thu,) studied this question.