What question did this study set out to answer?

This research aims to improve the detection of Chinese sensitive words amidst contextual variations and limited data.

May 24, 2026Open Access

BTMC: Branch Transformer Mutual-Information Calibration Network for Chinese Sensitive-Word Detection with Few-Shot Learning

Key Points

This research aims to improve the detection of Chinese sensitive words amidst contextual variations and limited data.
Developed a Branch Transformer structure for semantic interaction across dimensions.
Introduced a global-local interactive fusion mechanism for contextual analysis under few-shot conditions.
Implemented semantic calibration regularization to reweight features and enhance discriminability.
Achieved F1-scores of 0.9715 in Politics and Violence (10-15% improvement over baselines).
Achieved F1-scores of 0.9683 in Rudeness and Vulgarity (10-15% improvement over baselines).
Achieved F1-scores of 0.9704 in Drugs and Gambling (10-15% improvement over baselines).

Abstract

Accurate identification of Chinese sensitive words is critical for maintaining online information security. However, this task faces three technical challenges: (1) high contextual dependency causing semantic ambiguity; (2) adversarial variations (e.g., homophones, character splitting) that evade exact matching; and (3) scarcity of high-quality annotated samples in complex scenarios, leading to few-shot distribution characteristics. To address these challenges, we propose a Branch Transformer Mutual-Information Calibration (BTMC) network. Specifically: (i) to capture multi-level, cross-dimensional semantic interactions despite limited data, we design a branch-based Transformer structure that aligns and fuses features across different semantic dimensions; (ii) to establish context channels between global and local semantics under few-shot conditions, we introduce a global-local interactive fusion mechanism that enhances focus on core semantics; (iii) to improve discriminability of complex semantic patterns, we propose a semantic calibration regularization mechanism that reweights features and balances information distribution. Experimental results on a newly constructed Chinese sensitive words dataset (45,623 sentences, four categories) demonstrate that BTMC achieves average F1-scores of 0.9715 (Politics and Violence), 0.9683 (Rudeness and Vulgarity), 0.9704 (Drugs and Gambling), and 0.9531 (Others), outperforming state-of-the-art baselines by 10–15% relative improvement. The code and dataset will be made publicly available.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper