• Presents VIDS-Guard , a forensics-aware multi-stream transformer model. • Integrates SRM, YCbCr, FFT , and temporal attention for video-level detection. • Achieves AUC = 0.97 and Macro-F1 = 0.90 on unseen deepfake datasets. • Demonstrates strong cross-dataset generalization under domain shift. • Offers an efficient 25.9 M-parameter design for real-world deployment. The proliferation of highly realistic deepfake videos poses a growing threat to digital trust, underscoring the need for detectors that remain reliable across diverse manipulation types and capture conditions. This paper introduces VIDS-Guard (Video Integrity Deepfake Shield), a novel forensics-aware multi-stream transformer framework that integrates spatial, frequency, and temporal cues within a unified architecture. Unlike conventional convolutional or transformer-based detectors that rely primarily on semantic consistency, VIDS-Guard embeds forensic inductive biases through Spatial Rich Model (SRM) residual filtering, YCbCr color-space decomposition, and Fast Fourier Transform (FFT) spectral embeddings to expose subtle manipulation artifacts. A temporal transformer encoder with attention pooling further models cross-frame inconsistencies, enabling robust video-level predictions. Extensive experiments conducted with six benchmark models—Xception, ResNet50, MobileNetV3-Large, SlowFast, ViViT, and TimeSformer—demonstrate that VIDS-Guard achieves superior generalization and balanced detection performance across validation, test, and unseen datasets, attaining the highest accuracy and Macro-F1 scores under domain shift. These findings establish VIDS-Guard as a state-of-the-art forensic framework for trustworthy multimedia authentication and emphasize the importance of incorporating forensic priors to ensure sustainable robustness in deepfake video detection.
Alanazi et al. (Wed,) studied this question.