This study explores the efficacy of four Large Language Models (LLMs)—BERT, DistilBERT, RoBERTa, and DeBERTa—in classifying URLs as either legitimate or phishing. The research methodology is structured into three phases: dataset processing, model fine-tuning, and performance evaluation. Each LLM is fine-tuned to distinguish between phishing and legitimate URLs. The models are evaluated using both a primary dataset with extensive features and an external dataset with minimal features to rigorously assess their robustness. The models consistently achieved high performance on the primary dataset, with AUC scores of 0.99, indicating near-perfect discrimination between phishing and legitimate URLs. DistilBERT excels in F1-score (99.992%), accuracy (99.991%), and precision (99.985%), showcasing its efficiency in real-world applications. BERT and DeBERTa also demonstrate excellent results, while RoBERTa, though slightly lower in precision, remains competitive. The model’s performance was also evaluated on an external test dataset containing 450,176 labeled URLs, the dataset helped access the model’s performance under extremely constrained conditions. BERT and DeBERTa showed low F1-scores (1.14% and 2.78%, respectively) despite high precision, indicating poor recall. DistilBERT performs moderately better with a 42.44% F1-score, while RoBERTa achieves the highest F1-score of 68.03%, suggesting superior balance between precision and recall under severe feature constraints. Overall, while all models exhibit strong performance with rich features, their ability to maintain efficacy under limited feature conditions varies. This study underscores the importance of developing models with robust performance across diverse scenarios, highlighting RoBERTa’s superior recall in feature-scarce environments and DistilBERT’s overall efficiency.
Goyal et al. (Wed,) studied this question.