Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) classification and (ii) manipulated-region localization when pixel-level supervision is available. We evaluated the approach on three public benchmarks of increasing complexity—HiDF, SIDSet (SIDA), and CiFake—using each dataset’s official partitions where provided (SIDSet uses the predefined train/validation split) and a standardized preprocessing and training pipeline across experiments. Results: On HiDF, our model achieved strong performance on both video and image tracks (AUC up to 0. 931 on video and 0. 968 on images), yielding large gains relative to previously reported HiDF baselines under their published settings. On SIDSet, the model achieved 99. 1% three-class accuracy (real/synthetic/tampered) and produced accurate localization masks for many tampered regions, while we explicitly documented the split protocol and leakage checks to support the validity of the evaluation. On CiFake, the model exceeded 95% accuracy and attained an AUC of 0. 986. Conclusions: Overall, the results indicate that SigLIP-2 representations combined with multi-task training can deliver high detection accuracy and interpretable localization on challenging, realistic forgeries, while highlighting the importance of clearly stated evaluation protocols for fair comparison.
Soundararajan et al. (Thu,) studied this question.