What question did this study set out to answer?

The aim is to develop a robust video-level detector for deepfake videos using advanced forensic techniques.

April 12, 2026Open Access

VIDS-Guard: A Novel Forensics-Aware Multi-Stream Transformer Framework for Robust Deepfake Video Detection

Key Points

The aim is to develop a robust video-level detector for deepfake videos using advanced forensic techniques.
Developed a multi-stream transformer model integrating SRM, YCbCr, FFT, and temporal attention.
Conducted extensive experiments on six benchmark models for validation.
Assessed model performance on unseen deepfake datasets.
Achieved AUC of 0.97 and Macro-F1 of 0.90 on challenging datasets.
Demonstrated strong generalization ability across different domains.
Featured an efficient design with only 25.9 million parameters for practical use.

Abstract

• Presents VIDS-Guard , a forensics-aware multi-stream transformer model. • Integrates SRM, YCbCr, FFT , and temporal attention for video-level detection. • Achieves AUC = 0.97 and Macro-F1 = 0.90 on unseen deepfake datasets. • Demonstrates strong cross-dataset generalization under domain shift. • Offers an efficient 25.9 M-parameter design for real-world deployment. The proliferation of highly realistic deepfake videos poses a growing threat to digital trust, underscoring the need for detectors that remain reliable across diverse manipulation types and capture conditions. This paper introduces VIDS-Guard (Video Integrity Deepfake Shield), a novel forensics-aware multi-stream transformer framework that integrates spatial, frequency, and temporal cues within a unified architecture. Unlike conventional convolutional or transformer-based detectors that rely primarily on semantic consistency, VIDS-Guard embeds forensic inductive biases through Spatial Rich Model (SRM) residual filtering, YCbCr color-space decomposition, and Fast Fourier Transform (FFT) spectral embeddings to expose subtle manipulation artifacts. A temporal transformer encoder with attention pooling further models cross-frame inconsistencies, enabling robust video-level predictions. Extensive experiments conducted with six benchmark models—Xception, ResNet50, MobileNetV3-Large, SlowFast, ViViT, and TimeSformer—demonstrate that VIDS-Guard achieves superior generalization and balanced detection performance across validation, test, and unseen datasets, attaining the highest accuracy and Macro-F1 scores under domain shift. These findings establish VIDS-Guard as a state-of-the-art forensic framework for trustworthy multimedia authentication and emphasize the importance of incorporating forensic priors to ensure sustainable robustness in deepfake video detection.

VIDS-Guard: A Novel Forensics-Aware Multi-Stream Transformer Framework for Robust Deepfake Video Detection

Key Points

Abstract

Cite This Study