What question did this study set out to answer?

The research aims to develop an efficient deepfake detection method that balances accuracy and computational cost.

April 11, 2026Open Access

Multi-Scale Dynamic Perception and Context Guidance Modulation for Efficient Deepfake Detection

Key Points

The research aims to develop an efficient deepfake detection method that balances accuracy and computational cost.
Proposed a four-stage lightweight architecture for feature extraction.
Implemented a multi-scale dynamic perception mechanism with attention modules.
Introduced Context-Guided Dynamic Convolution for better feature modulation.
Evaluated model performance on FaceForensics++ and DeepFake Detection Challenge datasets.
Achieved an AUC of 91.98% on FaceForensics++ and 93.50% on DeepFake Detection Challenge.
Outperformed existing lightweight detection methods in terms of accuracy and efficiency.
Required only 3.06 M parameters and 1.36 G FLOPs for minimal computational load.

Abstract

Deepfake technology poses significant threats to information authenticity and social trust, necessitating effective detection methods. However, existing detection approaches predominantly rely on high-complexity network architectures that, while accurate in controlled environments, suffer from prohibitive computational costs that hinder deployment in resource-constrained scenarios such as social media platforms. To address this efficiency-accuracy dilemma, we propose a lightweight face forgery detection method that systematically learns multi-scale forgery traces. Our approach features a four-stage lightweight architecture that hierarchically extracts features from local textures to global semantics, mimicking the human visual system. Within each stage, a multi-scale dynamic perception mechanism divides feature channels into parallel groups equipped with lightweight attention modules to capture forgery cues spanning pixel-level anomalies, local structures, regional patterns, and semantic inconsistencies. Furthermore, rather than relying on conventional feature fusion that risks suppressing subtle artifacts, we introduce a novel Context-Guided Dynamic Convolution. This mechanism uses mid-level spatial anomalies as active anchors to dynamically modulate high-level semantic filters, with the goal of mitigating the disconnect between semantic content and forgery evidence. Our model achieves strong performance, yielding an AUC of 91.98% on FaceForensics++ and 93.50% on DeepFake Detection Challenge, outperforming current state-of-the-art lightweight methods. Furthermore, compared to heavy Vision Transformers, our model achieves a superior performance-efficiency trade-off, requiring only 3.06 M parameters and 1.36 G FLOPs, making it highly suitable for real-time, resource-constrained deployment.

Multi-Scale Dynamic Perception and Context Guidance Modulation for Efficient Deepfake Detection

Key Points

Abstract

Cite This Study