Due to the indispensable use of the internet, malicious actors have exploited URLs as a threat source of network information security and integrity. URL detection based on traditional methods has become inefficient against the uncontrolled increase of URLs, especially when facing dynamic and large-scale threats. To address the limitations of traditional methods and to provide intelligent and scalable detection of malicious URLs, this study proposes the hybrid framework (NetGuard) by integrating probabilistic data structures (PDSs) with machine learning (ML) capabilities. The proposed NetGuard utilizes PDSs to develop a Hybrid Scalable Detection Filter (HSDF), which combines the strengths of counting Bloom filters (CBFs) (deletion capability) and Scalable Bloom filters (SBFs). The proposed HSDF provides efficient membership queries under bounded false-positive rates (approximately 0.01) and ensures efficient data management and low-latency lookups on a scale of 10−5 s. On the other hand, NetGuard leverages the ML classifier capabilities to train and package a learned classifier for detecting malicious URLs. The proposed framework utilizes Decision Trees (DTs) and Random Forest (RF) classifiers. The proposed classifiers are trained by a novel SupURLsIdDs dataset which includes fifteen distinctive lexical and structural URL features extracted from four URL classes: benign, defacement, malware, and phishing URLs. The experimental results indicated the effectiveness of the HSDF in insertion and deletion operations, with minimal memory consumption (approximately 2.7 MB for 222,000 URLs) while maintaining a controlled false-positive rate (approximately 0.01 on Real-only subset up to 0.12 with synthetic data). The HSDF memory footprint represents a 99.88% enhancement compared to the RF model (which demands 2253.17 MB); thus, the HSDF complements RF as an ultra-lightweight first line of defense. The ML classifiers showed the superiority of RF, which achieved an overall classification accuracy of approximately 96% on large-scale URL data. These experiments are conducted using benchmark datasets constructed from aggregated real and synthetic data to demonstrate the scalability, adaptability, and resource efficiency of the first phase of NetGuard as a practical foundation for real-time web threat detection. The real-time integration and dynamic updates are presented as a deployment architecture and constitute future work.
Khudhur et al. (Wed,) studied this question.