Multi-modal sensing has become crucial in Human Activity Recognition (HAR) due to its ability to combine data from diverse sensors. However, challenges arise in recognizing various activities in different scenes using multi-modal data from different positions and devices, due to dynamic combinations of modal inputs, data heterogeneity, and scarcity of labeled data. To tackle these challenges, we propose MASTER, a multi-modal foundation model specifically designed for HAR. MASTER introduces a masked-data modeling-based self-supervised pre-training method, enabling the model to learn from unlabeled data and adapt to dynamic combinations of modal inputs. Moreover, it incorporates a few-shot alignment mechanism to facilitate adaptation to different activities, scenes, positions, and devices. Through the pre-training and fine-tuning on 7 multi-modal HAR datasets, MASTER currently supports, but is not limited to, 8 modalities (ACC, Gyro, mmWave, WiFi, Skeleton, Lidar, Infrared, and RGB) and 45 human activities. The results demonstrate that MASTER achieves the highest accuracy with minimal labeled data across various situations, surpassing alternative solutions.
Building similarity graph...
Analyzing shared references across papers
Loading...
Guanzhou Zhu
Dong Zhao
C-Q. Li
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies
Beijing University of Posts and Telecommunications
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhu et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68c188579b7b07f3a0612470 — DOI: https://doi.org/10.1145/3749511