What question did this study set out to answer?

The research aims to extend MISP taxonomies to better classify drug-related discourse on the Dark Web.

April 25, 2026Open Access

Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach

Key Points

The research aims to extend MISP taxonomies to better classify drug-related discourse on the Dark Web.
Integrated large language models with Human-in-the-Loop validation to enhance classification accuracy.
Analyzed a corpus of 6456 drug-related posts, refining the classification of 2904 posts.
Implemented network visualization to identify major discursive axes in digital drug markets.
Initial classification managed to assign 76.48% of posts to base morphological categories.
HITL refinement reduced the proportion of unclear posts from 23.52% to 11.29%, a 51.99% relative reduction.
Identified three major axes in drug discourse: recreational–commercial, pharmaceutical–opioid, and transnational–logistical.

Abstract

This study proposes a methodological framework for extending Malware Information Sharing Platform (MISP) taxonomies in the domain of Dark Web drug forums through the integration of large language models (LLMs) and Human-in-the-Loop (HITL) validation. The research addresses the existing ontological gap between traditional MISP taxonomies, focused on technical or chemical indicators, and the linguistic and morphological complexity of illicit digital markets. By modelling the primary physical form as an ontological predicate with mutually exclusive values (for example, powder, pill–tablet–capsule, liquid, and plant-matter), the proposed approach captures the material dimension of the discourse, enhancing semantic disambiguation and forensic traceability. The Mistral 7B model was used in the morphology-classification stage conducted on a stratified analytical subset of 2904 drug-related Dark Web posts, extracted from a final corpus of 6456 posts after data cleaning and relevance filtering. In the first pass, 76.48% of posts were directly assigned to one of the base morphological categories, while 23.52% were labelled as unclear and subsequently reviewed through the HITL stage. Following HITL refinement and full reclassification, the proportion of posts labelled as unclear decreased from 23.52% to 11.29%, corresponding to a 51.99% relative reduction in ambiguity. Network visualisation with VOSviewer revealed three major discursive axes—recreational–commercial, pharmaceutical–opioid, and transnational–logistical—reflecting the hybrid semantic structure of digital drug markets. The results show that combining LLM-based inference with expert oversight improves the interpretability, reproducibility and ontological robustness of cyberintelligence models, offering a replicable framework for other sensitive domains such as terrorism or child exploitation.

Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach

Key Points

Abstract

Cite This Study