Accurate identification of Chinese sensitive words is critical for maintaining online information security. However, this task faces three technical challenges: (1) high contextual dependency causing semantic ambiguity; (2) adversarial variations (e.g., homophones, character splitting) that evade exact matching; and (3) scarcity of high-quality annotated samples in complex scenarios, leading to few-shot distribution characteristics. To address these challenges, we propose a Branch Transformer Mutual-Information Calibration (BTMC) network. Specifically: (i) to capture multi-level, cross-dimensional semantic interactions despite limited data, we design a branch-based Transformer structure that aligns and fuses features across different semantic dimensions; (ii) to establish context channels between global and local semantics under few-shot conditions, we introduce a global-local interactive fusion mechanism that enhances focus on core semantics; (iii) to improve discriminability of complex semantic patterns, we propose a semantic calibration regularization mechanism that reweights features and balances information distribution. Experimental results on a newly constructed Chinese sensitive words dataset (45,623 sentences, four categories) demonstrate that BTMC achieves average F1-scores of 0.9715 (Politics and Violence), 0.9683 (Rudeness and Vulgarity), 0.9704 (Drugs and Gambling), and 0.9531 (Others), outperforming state-of-the-art baselines by 10–15% relative improvement. The code and dataset will be made publicly available.
Wang et al. (Fri,) studied this question.