Machine learning-based phishing detection models suffer from significant performance degradation when deployed across different datasets, a critical challenge that limits their real-world applicability. This research addresses this cross-dataset generalization problem by developing and validating a novel ensemble transfer learning framework designed to ensure robust performance in diverse operational environments. The proposed Cross-Domain Ensemble Probability Fusion (CDEPF) framework was evaluated using two heterogeneous datasets with zero feature overlap: the historical UCI Phishing Dataset and the modern PhiUSIIL Phishing URL Dataset. The methodology involves harmonizing these disparate feature sets into a unified 20-dimensional space using Principal Component Analysis (PCA) and integrating predictions through an information-theoretic weighted fusion strategy. Experimental results demonstrate that the CDEPF framework achieves a cross-dataset accuracy of 94.4%%, a substantial increase from the 57.4%% baseline performance. This represents a 64.3%% relative improvement, validated with high statistical significance ( p < 0 . 0001 ) and a large practical effect size. The framework provides a robust and deployment-ready solution that effectively bridges the performance gap in cross-domain phishing detection. This study contributes a validated methodological approach for domain adaptation in cybersecurity, enhancing the reliability of machine learning models against evolving cyber threats. Future work should explore multi-domain transfer architectures and real-world deployment validation. • Novel ensemble transfer learning framework achieving 94.4%% cross-dataset accuracy. • Comprehensive feature harmonization for zero-overlap datasets using PCA. • Statistical validation with 64.3% relative improvement over baseline. • Practical deployment-ready solution for cybersecurity applications.
Henry et al. (Sun,) studied this question.