December 8, 2025Open Access

Network Traffic Data Augmentation Using WGAN Model Guided by LLM

Key Points

Jointly leveraging structural and semantic conditioning yields higher-fidelity synthetic traffic, enhancing classification outcomes.
Synthetic data generation addresses class imbalance, impacting performance in network traffic classification for IoT.
Employing a graph-conditioned Wasserstein GAN model with LLM guidance sets new standards for realistic data augmentation.
Highlights the framework's potential for effective security analytics under data scarcity and privacy constraints.

Abstract

The Internet of Things (IoT) continues to expand across critical infrastructures, enabling automation, efficiency, and data driven decision making; yet, reliable device identification from network traffic remains hampered by severe class imbalance that skews learning and degrades performance. Synthetic data generation offers a promising remedy, particularly in privacy-sensitive security settings where access to representative traffic is limited. This paper advances the state of the art by proposing a framework that unites graph-conditioned generative modeling with large language model (LLM) guidance to produce realistic, semantically valid synthetic network traffic for imbalanced classification. First, we construct feature relationship graphs derived from Pearson correlation, Spearman rank correlation, and mutual information to capture inter-feature dependencies, and use these graphs to condition a Wasserstein GAN (WGAN), thereby preserving structural properties of real traffic during generation. Second, we employ an LLM to define class-specific semantic constraints, including admissible feature ranges, attribute correlations, and protocol level rules, which are enforced as soft guidance to steer the generator toward label-consistent and standards-compliant samples. Third, we institute a dual validation loop that combines LLM-based feedback on constraint satisfaction with evaluation of classifiers trained on datasets balanced by our method versus the traditional SMOTE technique. Lastly, extensive experiments demonstrate that jointly leveraging structural (graph) and semantic (LLM) conditioning yields higher-fidelity synthetic traffic and delivers consistent gains in macro-F1 and balanced accuracy for network traffic classification, highlighting the framework’s utility for security analytics under data scarcity and privacy constraints.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper