What question did this study set out to answer?

The research aims to improve phishing URL detection using a transformer-based system that incorporates contextual content features.

May 26, 2026Open Access

Phishing URL Detection Using Transformer-Based Architecture and Contextual Content Features

Read Full Paperexternally

Key Points

The research aims to improve phishing URL detection using a transformer-based system that incorporates contextual content features.
Developed SemanticPhishNet, a hybrid detection system utilizing MiniLM for contextual embeddings of URLs and HTML documents.
Implemented a simple dense classifier for binary classification of phishing and benign URLs.
Evaluated model performance using a stratified three-way data split, cross-validation, and external validation.
Achieved 96–97% accuracy in cross-validation; 67% accuracy in external evaluation, indicating effective generalization.
Performance exceeded state-of-the-art models in accuracy, recall, and generalization ability.
Confusion matrices and ROC analysis confirmed clear separation between phishing and benign classes.

Abstract

Phishing sites are increasingly causing harm to consumers, commercial enterprises, and the online infrastructure. Online safety is dependent on how well these evil sites can be detected in time and correctly. A number of solutions that exist are based on lexical, token features, or structural hints. Although useful to a certain degree, these methods tend to lose more contextual meaning in URLs. This paper presents SemanticPhishNet, a hybrid detection system that uses semantic knowledge to detect phishing attacks by utilizing semantic knowledge via a transformer-based system to process HTML documents and URL information to produce accurate and efficient phishing detections. The architecture uses MiniLM (identical type as distillbert) to obtain contextual embeddings of cleaned HTML and augmented text of URLs and a simple dense classifier to perform effective binary classification. A stratified three-way split of data was used to evaluate the model with real-world obfuscation patterns like replacement of “http” by “hxxp”. The experimental findings show that SemanticPhishNet has high performance in various measures, outperforming other state-of-the-art models in accuracy, recall and generalization ability. We conduct experiments on cross-validation and external validation with independent data. The framework exhibits good performance (96–97% cross-validation accuracy) and external evaluation demonstrates realistic generalization (67% accuracy), albeit revealing the difficulties of domain shift in phishing. The proposed model performs better than many of the existing models in the real world. The confusion matrices and ROC analysis indicate that the phishing and benign classes are consistently separated in both the validation and test sets. The findings show that the suggested model is efficient, stable, and scalable to the present-day phishing detection. The paper stresses the importance of appropriate evaluation techniques, such as leakage-aware splits and cross-dataset evaluation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Emad Ul Haq Qazi

Naif Arab University for Security Sciences

Muhammad Hamza Faheem

Abdulrazaq Almorjan

Naif Arab University for Security Sciences

Journals

Computers

Actions

Institutions

Naif Arab University for Security Sciences

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Phishing URL Detection Using Transformer-Based Architecture and Contextual Content Features

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study