What does this research mean for the field?

A virtualized host orchestration framework can generate diverse, accurately labeled network traffic datasets that effectively reveal the failure modes of unsupervised machine learning-based intrusion detection algorithms. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to develop a controlled virtual environment for generating high-quality, accurately labeled datasets for network intrusion detection.

June 10, 2026Open Access

A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security

Puntos clave

This research aims to develop a controlled virtual environment for generating high-quality, accurately labeled datasets for network intrusion detection.
Created a host orchestration framework within a Linux-based virtual machine to generate network traces.
Developed custom networking scenarios with scripted host behavior profiles for benign and malicious traffic interactions.
Performed quality analysis on the dataset using feature representations from raw packet capture and evaluated detection results with performance scores.
The dataset contains 81% malicious flows, providing a significant challenge for unsupervised streaming-based algorithms due to the overwhelming attack density.
Most algorithms misclassify benign traffic as anomalous, while the Sparse Data Observers (SDO) algorithm successfully detects malicious activity due to its semi-supervised learning approach.
The framework effectively generates diverse, clearly attributable network traffic, facilitating deeper investigations into ML-based detection failures.

Resumen

Modern network intrusion detection research relies on the availability of high-quality datasets for the development and validation of detection algorithms. However, data quality issues currently permeate such network traffic datasets---most notably a lack of representative benign traffic, poor data labeling, and the inability to reproduce and amend datasets. While recording real-world network traffic is the gold standard for high-quality data, poor reproducibility and costly labeling efforts limit its practicality as a data generation technique. Virtualized environments offer a cost-effective alternative in which network hosts generate traffic according to scripted behavior profiles, with data collection and labeling being fully automated.In this thesis we create a host orchestration framework within a Linux-based virtual machine, which we use to generate network traces and labeling metadata for small-scale, clearly defined network environments consisting of multiple benign hosts on an internal network and multiple external attackers. We develop custom networking scenarios which detail each host’s role on the network, and implement scripted host behavior profiles that specify malicious host-to-host and benign host-to-Internet-service interactions. We execute these networking scenarios using our host orchestration framework, and select the most diverse dataset in the generated collection for quality analysis. We evaluate the challenge posed by this dataset to unsupervised Machine Learning (ML)-based anomaly detection algorithms. We extract network flows from the dataset’s raw packet capture using several feature representations, and perform streaming-based and static analysis on these flows. Detection results are evaluated using algorithm performance scores and visual time-series-based analysis.Our dataset proves extremely challenging for fully unsupervised streaming-based algorithms due to the high percentage of malicious flows in the dataset (81%) and their high spatial and temporal density relative to sparse benign traffic; most algorithms adapt to attack traffic as the baseline for normal behavior and rank benign traffic as anomalous. An important exception is the Sparse Data Observers (SDO) algorithm, which successfully detects malicious traffic because it leverages a fully benign (semi-supervised) training phase to learn normality before anomalies are introduced. This shows that our datasets, despite being analytically demanding, are potentially solvable, presenting an attractive and necessary challenge for research and training in network security. In summary, our framework is capable of generating diverse, clearly attributable network traffic, which is useful for investigating and explaining failure modes of ML-based detection approaches in cybersecurity research. This addresses a known gap in the technical-scientific community that has been repeatedly identified by experts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Denis Vystaukin

Actions

Institutions

TU Wien

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Controlled Virtual Environment for High-Quality, Realistic, and Accurately Labeled Data Generation in Network Security

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider