What question did this study set out to answer?

The aim is to develop a framework for real-time phishing detection using webpage screenshots instead of traditional methods.

April 12, 2026Open Access

PhishViT: A Vision Transformer-Based Framework for Real-Time Phishing Detection from Webpage Screenshots

Puntos clave

The aim is to develop a framework for real-time phishing detection using webpage screenshots instead of traditional methods.
Developed PhishViT using a fine-tuned DeiT-Small architecture for visual analysis.
Iterative development in three phases with increasing screenshot samples.
Utilized 5-fold cross-validation to assess performance against baseline models.
Achieved a maximum accuracy of 91.75% in V3.
Demonstrated a notable F1-score of 91.49% and AUC-ROC of 0.9928.
Showed only a 2.06% drop in accuracy under various visual perturbations.

Resumen

This paper presents PhishViT, a Vision Transformer-based framework for real-time phishing detection from webpage screenshots. Unlike conventional methods that analyze URL strings or HTML source code, PhishViT operates on the visual rendering layer using a fine-tuned DeiT-Small architecture. The framework is developed through three iterative phases:V1 (253 screenshots, 78.95% accuracy), V2 (642 screenshots, 96.91% accuracy), and V3 (top-tier evaluation with baselines,5-fold cross-validation: 85.23%±1.18%, and robustness testing).V3 achieves 91.75% accuracy, 91.49% F1-score, AUC-ROC of 0.9928, and 5.44ms inference latency using DeiT-Small. Comprehensive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base demonstrates DeiT-Small's superior efficiency-accuracy trade-off. Robustness evaluation under six visual perturbation conditions confirms maximum 2.06% accuracy drop. Attention rollout visualization provides interpretable detection evidence.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo