Deepfake technology poses significant threats to information authenticity and social trust, necessitating effective detection methods. However, existing detection approaches predominantly rely on high-complexity network architectures that, while accurate in controlled environments, suffer from prohibitive computational costs that hinder deployment in resource-constrained scenarios such as social media platforms. To address this efficiency-accuracy dilemma, we propose a lightweight face forgery detection method that systematically learns multi-scale forgery traces. Our approach features a four-stage lightweight architecture that hierarchically extracts features from local textures to global semantics, mimicking the human visual system. Within each stage, a multi-scale dynamic perception mechanism divides feature channels into parallel groups equipped with lightweight attention modules to capture forgery cues spanning pixel-level anomalies, local structures, regional patterns, and semantic inconsistencies. Furthermore, rather than relying on conventional feature fusion that risks suppressing subtle artifacts, we introduce a novel Context-Guided Dynamic Convolution. This mechanism uses mid-level spatial anomalies as active anchors to dynamically modulate high-level semantic filters, with the goal of mitigating the disconnect between semantic content and forgery evidence. Our model achieves strong performance, yielding an AUC of 91.98% on FaceForensics++ and 93.50% on DeepFake Detection Challenge, outperforming current state-of-the-art lightweight methods. Furthermore, compared to heavy Vision Transformers, our model achieves a superior performance-efficiency trade-off, requiring only 3.06 M parameters and 1.36 G FLOPs, making it highly suitable for real-time, resource-constrained deployment.
Ding et al. (Thu,) studied this question.