Backdoor attacks represent a serious challenge to robust deployment of machine learning (ML) and deep learning (DL) models in safety- and mission-critical fields. In a backdoor attack, an adversary injects a hidden trigger so that the model behaves normally when the inputs are clean but consistently produces attacker-chosen outputs when the trigger is present. Existing defences generally work at a single stage in the ML lifecycle-on data, on the model, or at inference time-and are thus susceptible to adaptive attackers that intentionally evade their underlying assumptions. This paper proposes Multi-Stage Backdoor Detection (MSBD), which provides a defence-in-depth structure that combines multiple mechanisms in the training, post-training model inspection, and deployment-time monitoring. MSBD has four stages: influence-based screening of training samples (Stage A), optimisation-based trigger inversion (Stage B), neuron activation graph analysis for detection of suspicious subnetworks (Stage C), as well as calibrated runtime detection with integration of trigger signatures and perturbation-based consistency checks (Stage D). The platform is intended to function under realistic defender conditions with limited access to both data and models, and for offline validation and online monitoring. We evaluate MSBD on three benchmark datasets (MNIST, CIFAR-10, GTSRB) under a strong BadNets-style backdoor attack and compare it against five representative defences (STRIP, Neural Cleanse, Activation Clustering, Spectral Signatures, and Fine-Pruning). Across all datasets, we find an average F1-score of 99.0% for MSBD, which is consistently better than STRIP’s, with a practical runtime overhead, showing that multi-stage, cross-layer defences can significantly improve robustness over single-stage defences.
Abroshan et al. (Thu,) studied this question.