Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations, followed by a fine-tuning phase using advanced affine and color-based augmentations. We use DeiT models pre-trained weights, providing a strong initialization for learning manipulation artifacts, increasing the robustness of the detection model. Trained on a face-cropped dataset derived from the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71% accuracy after stage one and 99.22% accuracy with an AUROC of 99.97%, after stage two, achieving strong performance under the same face-level evaluation setting. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection. • A two-stage training approach, with progressive augmentation, is proposed for the Deepfake Detection Model. • Used Facebook DeiT Vision Transformers for superior detection compared to existing models. • Standard Training, followed by Fine-tuning with affine Augmentations, reached 99.22% accuracy and 99.97% AUROC.
Kumar et al. (Sun,) studied this question.