Hand-object interactions are central to everyday activities, yet most intelligent assistants today remain blind to users' physical actions. Existing IMU-based recognition approaches focus on classifying predefined gestures, but they lack the semantic expressiveness required for contextual support in real-world scenarios such as office work and home routines. In this paper, we introduce a semantic tokenization pipeline that bridges continuous inertial signals and large language models (LLMs), enabling assistants to “read” hand movements as naturally as words. We first collected a multimodal dataset of dual-hand activities across office and home environments capturing long-horizon action chains that span multiple interrelated sub-tasks. Using self-supervised representation learning, we discretize IMU embeddings into action tokens that approximate a vocabulary of hand interactions. These tokens are then aligned with natural language through instruction-tuned LLMs, supporting tasks such as action captioning, intent inference, and contextual feedback. Evaluation shows that our tokenization improves semantic consistency with language distributions, and the LLM produces accurate, human-preferred descriptions of actions across diverse activities. We further demonstrate a proof-of-concept assistant prototype that generates contextual reminders. Our findings highlight the potential of transforming raw hand motions into a “language of actions,” paving the way for everyday intelligent assistants that are aware of users' physical interactions. The Project page, source code, and dataset are publicly available at https://scut-hai.github.io/HMotionGPT/.
Gao et al. (Mon,) studied this question.