What question did this study set out to answer?

This paper aims to investigate the effectiveness of reinforcement learning algorithms in detecting cyber threats in enterprise networks.

April 8, 2026Open Access

CyberShield: A Comparative Study of Reinforcement Learning Algorithms for Autonomous Cyber Threat Hunting in Simulated Enterprise Networks

Key Points

This paper aims to investigate the effectiveness of reinforcement learning algorithms in detecting cyber threats in enterprise networks.
Developed CyberShield, a simulated network environment with 14 nodes and four threat types.
Compared four reinforcement learning algorithms (PPO, DQN, A2C, REINFORCE) across 48 hyperparameter configurations.
Utilized an 89-dimensional state vector for agent observation and six possible actions.
Implemented a reward function that balances correct threat neutralization with penalties for false positives and wasted time.
PPO achieved the highest mean reward of +14.30 and generalized to 5 out of 10 unseen environments.
DQN reached +11.98 with significantly less training time (73 seconds).
Identified a failure mode where the best agent quarantines clean nodes, indicating potential improvements needed in action design.

Abstract

Most network intrusion detection systems on the market today work by matching traffic against databases of known attack signatures or by flagging statistical anomalies relative to some learned baseline of “normal” behavior. Neither approach handles novelty well: a zero-day exploit, a polymorphic payload, or an attacker who changes tactics mid-campaign can slip past both without triggering an alert until a human analyst happens to notice something off. This paper investigates whether reinforcement learning agents can do better by treating network defense as a sequential decision problem where the defender must patrol, scan, and respond to threats across a heterogeneous topology without knowing the attack signatures in advance. The paper presents CyberShield, a custom Gymnasium environment that simulates a 14-node enterprise network with four threat types (malware, backdoors, cryptominers, and data exfiltration), each with different damage rates, spreading behavior, and stealth characteristics. The agent observes an 89-dimensional state vector and chooses from six actions under a reward function that penalizes false positives, rewards correct threat neutralization, and punishes wasted time. Four algorithms (PPO, DQN, A2C, and REINFORCE) are compared across 48 hyperparameter configurations. PPO with a 256, 256, 128 architecture reaches the highest mean reward (+14.30) and generalizes to 5 out of 10 unseen test environments, while DQN reaches +11.98 with far less training time (73 seconds). The experiments also reveal a failure mode worth attention: the best-performing agent learns to quarantine clean nodes because the quarantine action is irreversible, and the agent rationally prefers a small certain penalty over the risk of unchecked infection spread. The implications for hierarchical RL designs and reversible action spaces are discussed. All code, trained models, and the deployment infrastructure are publicly available.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wilsons Navid WADO TIWA (Mon,) studied this question.

synapsesocial.com/papers/69d5f0bb74eaea4b11a7a281 https://doi.org/https://doi.org/10.5281/zenodo.19436536

Bookmark

View Full Paper