This paper presents PhishViT, a Vision Transformer-based framework for real-time phishing detection from webpage screenshots. Unlike conventional methods that analyze URL strings or HTML source code, PhishViT operates on the visual rendering layer using a fine-tuned DeiT-Small architecture. The framework is developed through three iterative phases:V1 (253 screenshots, 78.95% accuracy), V2 (642 screenshots, 96.91% accuracy), and V3 (top-tier evaluation with baselines,5-fold cross-validation: 85.23%±1.18%, and robustness testing).V3 achieves 91.75% accuracy, 91.49% F1-score, AUC-ROC of 0.9928, and 5.44ms inference latency using DeiT-Small. Comprehensive baseline comparison against ResNet50, EfficientNet-B0, and ViT-Base demonstrates DeiT-Small's superior efficiency-accuracy trade-off. Robustness evaluation under six visual perturbation conditions confirms maximum 2.06% accuracy drop. Attention rollout visualization provides interpretable detection evidence.
Jean Chrysostome NDAYISABYE (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: