What question did this study set out to answer?

This research aims to enhance the detection of backdoor attacks in pre-trained language models through a new method called consensus deviation.

March 3, 2026Open Access

A Backdoor Label Verification Method Based on Consensus Deviation for Pre-Trained Language Models

Key Points

This research aims to enhance the detection of backdoor attacks in pre-trained language models through a new method called consensus deviation.
Developed a backdoor detection method based on cognitive consensus verification.
Shifted focus from surface-level metrics to deep cognitive analysis.
Conducted experiments on multiple datasets including SST-2 and AG's News.
Significantly reduced attack success rates compared to existing methods.
Enhanced robustness against various attack scenarios across different levels.
Achieved reliable detection without dependency on fixed statistical thresholds.

Abstract

Backdoor attacks pose a critical security risk to pre-trained language models (PLMs) by utilizing concealed triggers to manipulate model outputs. Existing defense strategies largely depend on statistical thresholds, which often struggle to identify sophisticated backdoor samples that exhibit high cognitive similarity to benign data. Such similarities make precise threshold calibration difficult, frequently leading to unreliable or failed detection. To overcome these limitations, we propose a backdoor detection method based on consensus deviation, shifting the defensive paradigm from surface-level statistical metrics to deep cognitive consensus verification. This approach obviates the reliance on fixed thresholds, enabling the more robust identification of covert triggers. Extensive experiments on the SST-2, HSOL, and AG‘s News datasets revealed that our method achieved significantly lower attack success rates (ASRs) and enhanced robustness compared with the current baselines across word-, sentence-, and structural-level attack scenarios.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Yang et al. (Sat,) studied this question.

synapsesocial.com/papers/69a67ed1f353c071a6f0a612 https://doi.org/https://doi.org/10.3390/electronics15051015

Bookmark

View Full Paper