Identification of Criminal Networks Via External Identifier-Based Clustering on the Dark Web
Abstract
The dark web’s anonymity infrastructure facilitates serious crimes, including drug trafficking, arms dealing, and the distribution of child sexual abuse material. These illicit ecosystems are organizationally structured; the same actors disperse activities across multiple domains to evade investigation. Consequently, approaches that analyze individual dark web sites independently have limited ability to identify criminal organizations, derive investigative leads, or reveal cross-site relationships. To address this limitation, we propose a dark web external identifier clustering methodology that detects common organizations by leveraging shared external identifiers (email, Telegram, and Twitter) observed across distributed sites. From approximately 1.49 million crawled dark web URLs, our recursive crawler observed approximately 100,000 subdomains returning HTTP 200. Among these, 24,403 subdomains contained at least one external identifier, from which we selected 10,274 criminal subdomains for analysis. We organized the subdomains under seven scenarios spanning three strategies: Single Identifier Clustering, which groups domains that share a single chosen identifier; Sequential Clustering, which traces inter-organizational ties by varying the combinations and initial order of the three identifiers; and Global Clustering, which organizes the full connectivity induced by all collected identifiers. For each discovered organization, we verified organizational sameness through HTML similarity, domain prefix matching, and manual verification via direct access. Our results show that Global Clustering identifies the largest number of organizations, whereas email anchored Sequential Clustering yields the most effective clues for understanding and tracking organizational operating structures.
Key Points
- Global Clustering detects the largest number of criminal organizations across the dark web—identifying significant interconnections.
- Approximately 1.49 million URLs were crawled, with 10,274 selected subdomains containing critical external identifiers for the analysis.
- The approach combines Single Identifier, Sequential, and Global Clustering strategies to improve network recognition and tracking capabilities.
- Findings indicate that email-based Sequential Clustering effectively offers important leads for understanding organizational structures.