What question did this study set out to answer?

The research aims to enhance deepfake detection through a framework utilizing CLIP-based SigLIP-2 vision transformers and multi-task learning.

March 21, 2026Open Access

Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers

Puntos clave

The research aims to enhance deepfake detection through a framework utilizing CLIP-based SigLIP-2 vision transformers and multi-task learning.
Developed a deepfake detection framework using CLIP-derived SigLIP-2 vision transformers.
Employed a multi-task design for classification and manipulated-region localization.
Evaluated on three public benchmarks: HiDF, SID_Set, and CiFake.
Achieved AUC up to 0.931 on HiDF video and 0.968 on images.
Obtained 99.1% accuracy on SID_Set for real, synthetic, and tampered categories.
Exceeded 95% accuracy and reached an AUC of 0.986 on CiFake.

Resumen

Background: Deepfakes pose a growing threat to the integrity of visual media, motivating detectors that remain reliable as forgeries become increasingly realistic. Methods: We propose a deepfake detection framework built on CLIP-derived SigLIP-2 vision transformers and a multi-task design that jointly performs (i) classification and (ii) manipulated-region localization when pixel-level supervision is available. We evaluated the approach on three public benchmarks of increasing complexity—HiDF, SIDSet (SIDA), and CiFake—using each dataset’s official partitions where provided (SIDSet uses the predefined train/validation split) and a standardized preprocessing and training pipeline across experiments. Results: On HiDF, our model achieved strong performance on both video and image tracks (AUC up to 0. 931 on video and 0. 968 on images), yielding large gains relative to previously reported HiDF baselines under their published settings. On SIDSet, the model achieved 99. 1% three-class accuracy (real/synthetic/tampered) and produced accurate localization masks for many tampered regions, while we explicitly documented the split protocol and leakage checks to support the validity of the evaluation. On CiFake, the model exceeded 95% accuracy and attained an AUC of 0. 986. Conclusions: Overall, the results indicate that SigLIP-2 representations combined with multi-task training can deliver high detection accuracy and interpretable localization on challenging, realistic forgeries, while highlighting the importance of clearly stated evaluation protocols for fair comparison.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo