Backdoor attacks pose a critical security risk to pre-trained language models (PLMs) by utilizing concealed triggers to manipulate model outputs. Existing defense strategies largely depend on statistical thresholds, which often struggle to identify sophisticated backdoor samples that exhibit high cognitive similarity to benign data. Such similarities make precise threshold calibration difficult, frequently leading to unreliable or failed detection. To overcome these limitations, we propose a backdoor detection method based on consensus deviation, shifting the defensive paradigm from surface-level statistical metrics to deep cognitive consensus verification. This approach obviates the reliance on fixed thresholds, enabling the more robust identification of covert triggers. Extensive experiments on the SST-2, HSOL, and AG‘s News datasets revealed that our method achieved significantly lower attack success rates (ASRs) and enhanced robustness compared with the current baselines across word-, sentence-, and structural-level attack scenarios.
Yang et al. (Sat,) studied this question.