What question did this study set out to answer?

This research aims to develop a robust phishing detection framework capable of identifying threats across various languages.

June 20, 2026Open Access

PhishGuard AI: A Hybrid Multilingual Phishing Detection System Using Machine Learning and Transformer-Based NLP

Key Points

This research aims to develop a robust phishing detection framework capable of identifying threats across various languages.
Developed PhishGuard AI framework combining supervised learning and Multilingual BERT for phishing detection.
Utilized a unified pipeline for processing raw URLs and full email messages, extracting a 21-dimensional set of indicators.
Conducted validation on 11,055 samples using 10-fold stratified cross-validation.
Achieved a Random Forest accuracy of 97.60% and an AUC-ROC of 0.978 in phishing detection.
Implemented a soft-voting ensemble of three classifiers resulting in an improved accuracy of 97.90%.
The system was deployed via a Flask REST API supporting various user interfaces.

Abstract

Cyber-enabled fraud has grown substantially more sophisticated over the last decade, with phishing emerging as one of the most damaging and widely deployed attack vectors in the digital threat landscape. This study introduces PhishGuard AI, a detection framework that fuses traditional supervised learning with the cross-lingual capabilities of Multilingual BERT (mBERT) to flag phishing attempts written in any of 104 languages. A fundamental shortcoming of current commercial and academic systems is their near-exclusive reliance on English text and pre-catalogued threat signatures, rendering them blind to freshly launched campaigns and non-English content. To overcome these constraints, PhishGuard AI employs a single unified pipeline that accepts both raw URLs and full email messages as input, extracting a 21-dimensional set of structural and live behavioural indicators while simultaneously deriving 50-dimensional language embeddings through principal component reduction of the bert-base-multilingual-cased CLS vector. Experimental validation on 11,055 balanced samples under a rigorous 10-fold stratified cross-validation regime yielded a Random Forest accuracy of 97.60% and an AUC-ROC of 0.978; a soft-voting ensemble of three classifiers pushed accuracy further to 97.90%. Deployment is realised through a Flask REST API serving a browser extension, a web dashboard, and an early-stage email gateway prototype.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Shejwal et al. (Thu,) studied this question.

synapsesocial.com/papers/6a3631a1db0793dc1a53879e https://doi.org/https://doi.org/10.56975/ijsdr.v11i6.310578

KI fragen

Bookmark

View Full Paper