What question did this study set out to answer?

The aim is to develop an innovative framework for enhanced phishing email detection by integrating multi-modal features.

January 16, 2026Open Access

LLM-Based Multimodal Feature Extraction and Hierarchical Fusion for Phishing Email Detection

Key Points

The aim is to develop an innovative framework for enhanced phishing email detection by integrating multi-modal features.
Developed a phishing detection framework called SAHF-PD integrating multiple feature modalities.
Utilized domain-specific prompts to extract features from email body, URLs, screenshots, and source code.
Implemented Semantic-Aware Hierarchical Fusion to organize features by semantic relevance.
Introduced PhishMMF, a dataset containing 11,672 verified phishing samples for training and evaluation.
Achieved an AUC of 0.99927 and an F1-score of 0.98728 with XGBoost using SAHF.
Reduced dimensionality from 228 to 56 features (75.4% reduction) while maintaining accuracy.
Decreased average training time by 43.7% across eight classifiers.

Abstract

Phishing emails continue to evade conventional detection systems due to their increasingly sophisticated, multi-faceted social engineering tactics. To address the limitations of single-modality or rule-based approaches, we propose SAHF-PD, a novel phishing detection framework that integrates multi-modal feature extraction with semantic-aware hierarchical fusion, based on large language models (LLMs). Our method leverages modality-specialized large models, each guided by domain-specific prompts and constrained to a standardized output schema, to extract structured feature representations from four complementary sources associated with each phishing email: email body text; open-source intelligence (OSINT) derived from the key embedded URL; screenshot of the landing page; and the corresponding HTML/JavaScript source code. This design mitigates the unstructured and stochastic nature of raw generative outputs, yielding consistent, interpretable, and machine-readable features. These features are then integrated through our Semantic-Aware Hierarchical Fusion (SAHF) mechanism, which organizes them into core, auxiliary, and weakly associated layers according to their semantic relevance to phishing intent. This layered architecture enables dynamic weighting and redundancy reduction based on semantic relevance, which in turn highlights the most discriminative signals across modalities and enhances model interpretability. We also introduce PhishMMF, a publicly released multimodal feature dataset for phishing detection, comprising 11,672 human-verified samples with meticulously extracted structured features from all four modalities. Experiments with eight diverse classifiers demonstrate that the SAHF-PD framework enables exceptional performance. For instance, XGBoost equipped with SAHF attains an AUC of 0.99927 and an F1-score of 0.98728, outperforming the same model using the original feature representation. Moreover, SAHF compresses the original 228-dimensional feature space into a compact 56-dimensional representation (a 75.4% reduction), reducing the average training time across all eight classifiers by 43.7% while maintaining comparable detection accuracy. Ablation studies confirm the unique contribution of each modality. Our work establishes a transparent, efficient, and high-performance foundation for next-generation anti-phishing systems.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper