Sequence recognition models under a fully supervised paradigm require large-scale training data, incurring substantial annotation costs. Pseudo-labeling is one of the most effective techniques in semi-supervised learning, which leverages predicted confidence to filter pseudo-labels on unlabeled data for model training. However, recent studies indicate that the performance of semi-supervised learning is compromised by overconfident models, as the predicted unreliable confidences will filter noisy samples into training. In this work, we discover that the overconfidence in sequence recognition models is influenced by the linguistic properties of a sequence, where the tail character classes are prone to be mispredicted as the head ones that frequently appear in the language with high confidence. And this overconfidence continuously intensifies throughout the semi-supervised training process. To address this limitation, we propose a Dynamic Sequential Class-aware Smoothing (DSCS) method that calibrates the overconfidence of the head class to alleviate the inaccurate pseudo-labeling caused by overconfident misprediction to improve the quality of pseudo-labels. Specifically, we design a sequential class-aware smoothing module that incorporates token class frequency information to regularize the model and prevent it from becoming overconfident toward the head class. Meanwhile, to address the overconfidence problem intensifying throughout the semi-supervised learning processes, we introduce a dynamic regularization module to adjust the calibration strength dynamically for the coordination between the calibration and semi-supervised learning processes. Extensive experiments demonstrate the effectiveness and generality of our method, which significantly reduces annotation efforts while maintaining competitive recognition performance.
Xu et al. (Mon,) studied this question.