Face-swap deepfakes have risen in fidelity and accessibility, posing growing threats to personal privacy, identity integrity, and public trust in digital media. The sophistication of modern generative models allows manipulated content to bypass casual human observation and even deceive conventional automated detectors. This growing realism demands robust, transparent, and computationally efficient detection systems. We propose a lightweight, multi-modal AI framework that fuses spatial (CNN), temporal (LSTM/GRU), frequency (DCT/FFT), and audio modalities through an attention-based fusion mechanism to identify face-swap deepfake videos. The framework is designed not only for detection accuracy but also for real-world deployability—leveraging key-frame extraction and compact neural backbones to operate effectively on constrained hardware. In addition, explainability is prioritized through visualization tools such as Grad-CAM, integrated gradients, and modalitylevel confidence reporting to enhance forensic interpretability. Our work bridges the gap between high-performance academic models and practical field applications by focusing on modular design, reproducible experimentation, and cross-dataset generalization. The resulting system aims to support real-time media verification pipelines, assist investigators in forensic reporting, and promote public resilience against synthetic media threats. Overall, the framework lays the foundation for transparent, efficient, and responsible deepfake detection in the evolving landscape of generative AI.
Singh et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: