Most network intrusion detection systems on the market today work by matching traffic against databases of known attack signatures or by flagging statistical anomalies relative to some learned baseline of “normal” behavior. Neither approach handles novelty well: a zero-day exploit, a polymorphic payload, or an attacker who changes tactics mid-campaign can slip past both without triggering an alert until a human analyst happens to notice something off. This paper investigates whether reinforcement learning agents can do better by treating network defense as a sequential decision problem where the defender must patrol, scan, and respond to threats across a heterogeneous topology without knowing the attack signatures in advance. The paper presents CyberShield, a custom Gymnasium environment that simulates a 14-node enterprise network with four threat types (malware, backdoors, cryptominers, and data exfiltration), each with different damage rates, spreading behavior, and stealth characteristics. The agent observes an 89-dimensional state vector and chooses from six actions under a reward function that penalizes false positives, rewards correct threat neutralization, and punishes wasted time. Four algorithms (PPO, DQN, A2C, and REINFORCE) are compared across 48 hyperparameter configurations. PPO with a 256, 256, 128 architecture reaches the highest mean reward (+14.30) and generalizes to 5 out of 10 unseen test environments, while DQN reaches +11.98 with far less training time (73 seconds). The experiments also reveal a failure mode worth attention: the best-performing agent learns to quarantine clean nodes because the quarantine action is irreversible, and the agent rationally prefers a small certain penalty over the risk of unchecked infection spread. The implications for hierarchical RL designs and reversible action spaces are discussed. All code, trained models, and the deployment infrastructure are publicly available.
Wilsons Navid WADO TIWA (Mon,) studied this question.