Abstract Sophisticated Deepfake technologies increasingly challenge the authenticity of digital media, underscoring the need for advanced multimodal detection methods. This review synthesizes cutting-edge deep learning approaches for identifying audio-visual forgeries, emphasizing fusion strategies that seamlessly integrate visual and auditory signals to combat complex manipulations. By evaluating key public datasets and benchmarks, we highlight their efficacy in critical applications, including social media content moderation, judicial forensics, and fraud prevention. Despite notable advances, limitations in cross-domain generalization and computational efficiency hinder practical deployment. Future efforts should focus on developing lightweight, scalable architectures and standardized evaluation protocols to bolster detection robustness across diverse real-world scenarios, safeguarding the integrity of digital content.
Tan et al. (Mon,) studied this question.