Machine learning-based network intrusion detection systems (NIDSs) remain vulnerable to adversarial manipulation, but the robustness literature for tabular NIDS data is still dominated by single-model, single-dataset, and non-adaptive evaluations. In this paper, we reposition the manuscript as a comparative robustness study of a four-component defense pipeline rather than as a claim of a universal defense primitive. We evaluate XGBoost, LightGBM, TabNet, and Residual MLP on RTIOT2022 and WebIDS23 under standard attacks, representative constrained/adaptive attacks, component-wise ablations, sample-fraction sensitivity, repeated-run significance tests, per-class F1 analysis, and computational-overhead measurements. The results show strong dataset and architecture dependence. On RTIOT2022, tree-based models close most of the robustness gap under strong attacks but often only after large clean-accuracy reductions; Residual MLP achieves a more favorable balance, while the full defense stack over-regularizes TabNet. On WebIDS23, aggregate robustness-gap reduction remains positive, yet simpler baselines such as adversarial-training-only or ensemble-only configurations frequently outperform the full four-stage pipeline in absolute clean/attack accuracy. Across both datasets, median filtering is the most fragile component: larger filter windows substantially degrade both clean and attacked accuracy, whereas contamination rate, anomaly-mixing weight, and ensemble size are comparatively stable. Representative constrained/adaptive evaluations reduce performance only modestly relative to standard FGSM/PGD, but per-class and overhead analyses show that minority-class collapse and training cost remain important deployment limitations. These findings support a more cautious conclusion: adversarial defense for tabular NIDS is validation driven and dataset specific, and the full defense stack should not be treated as a universal default.
Le et al. (Fri,) studied this question.