What does this research mean for the field?

A hybrid Vision Transformer Convolutional Neural Network (ViT-CNN) architecture enables highly accurate, real-time, and non-intrusive driver drowsiness detection suitable for embedded deployment. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To develop a real-time drowsiness detection framework using a deep learning approach that enhances road safety.

March 26, 2026Open Access

Enhancing road safety through deep learning-based drowsiness detection using vision AI

Key Points

To develop a real-time drowsiness detection framework using a deep learning approach that enhances road safety.
Utilized hybrid Vision Transformer Convolutional Neural Network architecture.
Employed multi-task learning and temporal attention mechanisms.
Performed self-supervised pretraining on over 1.2 million unlabeled driving videos.
Optimized for embedded deployment with INT8 quantization and TensorRT.
Applied explainability tools like Grad-CAM++ and Bayesian uncertainty estimation.
Achieved 99.27% accuracy in drowsiness detection.
Obtained an F1 score of 0.98 and AUC of 0.998.
Sustained 42 frames per second at 42 ms latency on NVIDIA Jetson AGX Xavier.
Demonstrated strong generalization across six diverse datasets.

Abstract

Driver drowsiness is a major cause of road accidents worldwide, leading to thousands of fatalities and billions in economic losses each year. This study presents a real-time drowsiness detection framework based on a hybrid Vision Transformer Convolutional Neural Network (ViT-CNN) architecture enhanced with multi-task learning and temporal attention. Unlike traditional sensor-based or reactive vehicle behavior methods, the proposed vision-based approach provides a non-intrusive, scalable solution capable of detecting early fatigue indicators such as eye closure, yawning, and head pose. The model leverages self-supervised pretraining on over 1.2 M unlabeled driving videos and is optimized for embedded deployment using INT8 quantization and TensorRT, achieving 99.27% accuracy, F1 = 0.98, and AUC = 0.998 while sustaining 42 FPS at 42 ms latency on the NVIDIA Jetson AGX Xavier. Explainability tools (Grad-CAM + + and Bayesian uncertainty estimation) ensure transparency in safety-critical contexts. Evaluation across six datasets demonstrates strong generalization, and the framework is adaptable to other fatigue-sensitive domains such as aviation and industrial safety.

Perguntar à IA

Bookmark

View Full Paper