Phishing emails continue to evade conventional detection systems due to their increasingly sophisticated, multi-faceted social engineering tactics. To address the limitations of single-modality or rule-based approaches, we propose SAHF-PD, a novel phishing detection framework that integrates multi-modal feature extraction with semantic-aware hierarchical fusion, based on large language models (LLMs). Our method leverages modality-specialized large models, each guided by domain-specific prompts and constrained to a standardized output schema, to extract structured feature representations from four complementary sources associated with each phishing email: email body text; open-source intelligence (OSINT) derived from the key embedded URL; screenshot of the landing page; and the corresponding HTML/JavaScript source code. This design mitigates the unstructured and stochastic nature of raw generative outputs, yielding consistent, interpretable, and machine-readable features. These features are then integrated through our Semantic-Aware Hierarchical Fusion (SAHF) mechanism, which organizes them into core, auxiliary, and weakly associated layers according to their semantic relevance to phishing intent. This layered architecture enables dynamic weighting and redundancy reduction based on semantic relevance, which in turn highlights the most discriminative signals across modalities and enhances model interpretability. We also introduce PhishMMF, a publicly released multimodal feature dataset for phishing detection, comprising 11,672 human-verified samples with meticulously extracted structured features from all four modalities. Experiments with eight diverse classifiers demonstrate that the SAHF-PD framework enables exceptional performance. For instance, XGBoost equipped with SAHF attains an AUC of 0.99927 and an F1-score of 0.98728, outperforming the same model using the original feature representation. Moreover, SAHF compresses the original 228-dimensional feature space into a compact 56-dimensional representation (a 75.4% reduction), reducing the average training time across all eight classifiers by 43.7% while maintaining comparable detection accuracy. Ablation studies confirm the unique contribution of each modality. Our work establishes a transparent, efficient, and high-performance foundation for next-generation anti-phishing systems.
Yuan et al. (Wed,) studied this question.