The rapid advancement of deepfake generation techniques has exposed critical limitations in existing deepfake detection methods, particularly their inability to simultaneously achieve high detection accuracy and real-time efficiency across diverse datasets. To address this gap, this study proposes the Efficient-Swin Attention Network (ESANet), a hybrid deep learning framework for real-time deepfake detection that jointly exploits local and global facial features. ESANet integrates EfficientNet-B0 for lightweight local feature extraction with the Swin Transformer to model hierarchical global facial relationships, and combines the two representations via an efficient feature fusion mechanism. The proposed framework is evaluated on three benchmark datasets, FaceForensics++, CelebV1, and CelebV2. Experimental results demonstrate detection accuracies of 96.5%, 95.3%, and 94.8%, respectively, while maintaining low inference latency suitable for real-time applications. Cross-dataset evaluations further confirm the robustness and generalisation capability of the proposed approach. By enabling accurate and efficient deepfake detection, this work helps strengthen trust and mitigate.
Javed et al. (Fri,) studied this question.