What does this research mean for the field?

A spatiotemporal artifact-aware framework integrating CNNs, Transformers, and frequency-domain filters achieves state-of-the-art accuracy in simultaneously detecting Deepfake videos and attributing their specific forgery algorithms. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to develop a framework that detects deepfake videos while attributing the algorithms used for forgery.

June 10, 2026Open Access

A spatiotemporal defect-integrated deepfake video detection and forgery algorithm attribution model

Key Points

This research aims to develop a framework that detects deepfake videos while attributing the algorithms used for forgery.
Proposed a spatiotemporal artifact-aware framework for detection and attribution.
Utilized a combination of Convolutional Neural Networks (CNNs) and Transformer models for feature extraction.
Employed a multi-loss optimization strategy incorporating cross-entropy, triplet loss, and hard sample mining.
Achieved 97.86±0.18% detection accuracy on FaceForensics++ dataset.
Reached 99.81±0.11% Area Under the Curve (AUC) for Deepfake detection.
Attained 98.42±0.15% accuracy in attributing forgery algorithms.

Abstract

As a prominent form of AI-generated content, Deepfake has aroused substantial safety concerns, as they substantially enhance the stealth of fraudulent activities and boost their success rates in real-world scenarios. Most existing research on Deepfake focuses primarily on detection tasks, and fails to fully capture the subtle manipulation traces that are unique to different forgery algorithms left during the synthesis process. Additionally, it is also crucial to attribute the specific generation algorithms of manipulated videos, which can help determine the type of forgery and reduce the negative impact of widespread misinformation dissemination. To fill this gap, this paper proposes a spatiotemporal artifact-aware framework designed to simultaneously accomplish two core tasks: Deepfake video detection and forgery algorithm attribution. Specifically, to comprehensively model the spatiotemporal information of the tampered video, the powerful local feature learning capability of Convolutional Neural Networks (CNNs) and the long-range dependency capturing capability of the Transformer are combined to mine the traces left behind by the forgery process from both the local and global information of the input dynamic image sequence. To enhance the model’s ability to capture robust forgery features, frequency-domain filter is innovatively integrated into the convolutional feature, amplifying the subtle traces carried by synthesis algorithms. Furthermore, considering the multi-scale nature of forgery traces, we utilize both middle-layer and deep-layer outputs of the backbone network to separately expose temporal defects at different feature levels. The final prediction result for input face sequences is obtained by fusing the predictions from these two components. The proposed framework is trained under the joint supervision of cross-entropy loss, triplet loss, and hard sample mining loss. This multi-loss optimization strategy effectively adjusts intra-class compactness and inter-class separability, enabling the model to learn more discriminative features for both detection and attribution. Comprehensive experiments on the FaceForensics++ dataset demonstrate that the proposed method achieves 97.86±0.18% detection accuracy and 99.81±0.11% AUC, as well as 98.42±0.15% accuracy for forgery algorithm attribution, outperforming most state-of-the-art approaches on this dataset.

Mark Helpful

Bookmark

Relay

View Full Paper