Self-supervised learning (SSL) can capture intrinsic features from extensive unlabeled data, significantly reducing dependence on labels and performing well in human activity recognition (HAR). However, existing SSL frameworks depend excessively on data augmentation paradigms, and often mistakenly treat noise as learning objectives during mask reconstruction. Moreover, the data set scale often constrains accuracy and hinders real-world applicability. To address these issues, this paper proposes a new SSL objective that integrates an attention mechanism with an adaptive time series mixer. Without relying on data augmentation, the proposed model assigns lower weights to noise for capturing the global dependencies and extracting the local feature within inertial measurement unit (IMU) series. The proposed model was validated through comprehensive evaluations of three public data sets (UCI, Motion, and HHAR) and one self-collected data set (named CQJTU-FCE). The experimental results fully demonstrate that, on the self-collected data set, the proposed model achieves an average improvement of 6.54%, 8.55%, and 7.88% in accuracy, F1 score, and Cohen's kappa coefficient, respectively, compared with the baseline models. Similarly, on the public data sets, the average enhancements reached 10.63%, 11.77%, and 13.39% across the same evaluation metrics. These results confirm the generalizability of the model to various data sets, offering a more efficient and reliable solution for HAR tasks.
Hou et al. (Sun,) studied this question.