The rapid advancement of deepfake technology poses severe threats to social security and information authenticity, as traditional single-modal detection methods face bottlenecks due to their vulnerability to circumvention. This paper systematically reviews deepfake face detection techniques based on multimodal biometric cross-verification, analyzing theoretical foundations, technical approaches, datasets, and challenges. Theoretically, it integrates visual features (facial micro-expressions, corneal specular highlights), auditory features (speech spectra, lip-sync consistency), and physiological signals (heart rate rhythms, facial blood flow), leveraging modal complementarity and consistency verification mechanisms to capture cross-modal forgery traces. Technically, it summarizes feature extraction methods such as CNN-based texture analysis, spectrogram modeling, and near-infrared imaging, and compares early fusion, late weighted voting fusion, and attention-guided dynamic fusion strategies—where attention mechanisms significantly enhance sensitivity to complex cues. It also organizes multimodal datasets (e.g., IAV-DF, DECRO) and evaluation metrics (accuracy, F1-score), providing standardized benchmarks. Although multimodal detection has improved robustness, it still faces challenges such as high-fidelity forgeries threatening modal consistency and inadequate adaptability to complex scenarios. Future research should focus on fine-grained biometric mining, lightweight model deployment, interpretability enhancement, and the improvement of regulations and technical standards to curb misuse and promote legitimate applications in digital security.
Haozhe Wu (Wed,) studied this question.