What question did this study set out to answer?

The aim is to develop an intelligent and scalable framework for detecting malicious URLs amidst increasing online threats.

June 12, 2026Open Access

NetGuard: A Hybrid Framework for Intelligent and Scalable Malicious URL Detection

Key Points

The aim is to develop an intelligent and scalable framework for detecting malicious URLs amidst increasing online threats.
Proposed the hybrid framework NetGuard integrating probabilistic data structures with machine learning.
Utilized counting Bloom filters and scalable Bloom filters for efficient URL detection and management.
Trained Decision Trees and Random Forest classifiers on the SupURLsIdDs dataset comprising diverse URL features.
HSDF maintained a controlled false-positive rate of approximately 0.01 and a latency of 10−5 s for queries.
Memory consumption for 222,000 URLs was approximately 2.7 MB, a 99.88% improvement over the Random Forest model's 2253.17 MB.
Random Forest achieved an overall classification accuracy of approximately 96% on large-scale URL data.

Abstract

Due to the indispensable use of the internet, malicious actors have exploited URLs as a threat source of network information security and integrity. URL detection based on traditional methods has become inefficient against the uncontrolled increase of URLs, especially when facing dynamic and large-scale threats. To address the limitations of traditional methods and to provide intelligent and scalable detection of malicious URLs, this study proposes the hybrid framework (NetGuard) by integrating probabilistic data structures (PDSs) with machine learning (ML) capabilities. The proposed NetGuard utilizes PDSs to develop a Hybrid Scalable Detection Filter (HSDF), which combines the strengths of counting Bloom filters (CBFs) (deletion capability) and Scalable Bloom filters (SBFs). The proposed HSDF provides efficient membership queries under bounded false-positive rates (approximately 0.01) and ensures efficient data management and low-latency lookups on a scale of 10−5 s. On the other hand, NetGuard leverages the ML classifier capabilities to train and package a learned classifier for detecting malicious URLs. The proposed framework utilizes Decision Trees (DTs) and Random Forest (RF) classifiers. The proposed classifiers are trained by a novel SupURLsIdDs dataset which includes fifteen distinctive lexical and structural URL features extracted from four URL classes: benign, defacement, malware, and phishing URLs. The experimental results indicated the effectiveness of the HSDF in insertion and deletion operations, with minimal memory consumption (approximately 2.7 MB for 222,000 URLs) while maintaining a controlled false-positive rate (approximately 0.01 on Real-only subset up to 0.12 with synthetic data). The HSDF memory footprint represents a 99.88% enhancement compared to the RF model (which demands 2253.17 MB); thus, the HSDF complements RF as an ultra-lightweight first line of defense. The ML classifiers showed the superiority of RF, which achieved an overall classification accuracy of approximately 96% on large-scale URL data. These experiments are conducted using benchmark datasets constructed from aggregated real and synthetic data to demonstrate the scalability, adaptability, and resource efficiency of the first phase of NetGuard as a practical foundation for real-time web threat detection. The real-time integration and dynamic updates are presented as a deployment architecture and constitute future work.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Khudhur et al. (Wed,) studied this question.

synapsesocial.com/papers/6a2ba4a18101cf8926f02fbe https://doi.org/https://doi.org/10.3390/jcp6030102

Bookmark

View Full Paper