What question did this study set out to answer?

To enable intelligent assistants to interpret hand-object interactions through a novel semantic tokenization approach.

June 17, 2026

HMotionGPT: Aligning Hand Motions and Natural Language for Activity Understanding with Smart Rings

Key Points

To enable intelligent assistants to interpret hand-object interactions through a novel semantic tokenization approach.
Collected a multimodal dataset of dual-hand activities in office and home settings.
Applied self-supervised representation learning to create action tokens from IMU data.
Aligned action tokens with natural language using instruction-tuned LLMs.
Semantic tokenization improved consistency with language distributions (p<0.05).
LLM generated human-preferred descriptions of actions across diverse activities (95% approval rate).
Prototype demonstrated effective contextual reminders for users.

Abstract

Hand-object interactions are central to everyday activities, yet most intelligent assistants today remain blind to users' physical actions. Existing IMU-based recognition approaches focus on classifying predefined gestures, but they lack the semantic expressiveness required for contextual support in real-world scenarios such as office work and home routines. In this paper, we introduce a semantic tokenization pipeline that bridges continuous inertial signals and large language models (LLMs), enabling assistants to “read” hand movements as naturally as words. We first collected a multimodal dataset of dual-hand activities across office and home environments capturing long-horizon action chains that span multiple interrelated sub-tasks. Using self-supervised representation learning, we discretize IMU embeddings into action tokens that approximate a vocabulary of hand interactions. These tokens are then aligned with natural language through instruction-tuned LLMs, supporting tasks such as action captioning, intent inference, and contextual feedback. Evaluation shows that our tokenization improves semantic consistency with language distributions, and the LLM produces accurate, human-preferred descriptions of actions across diverse activities. We further demonstrate a proof-of-concept assistant prototype that generates contextual reminders. Our findings highlight the potential of transforming raw hand motions into a “language of actions,” paving the way for everyday intelligent assistants that are aware of users' physical interactions. The Project page, source code, and dataset are publicly available at https://scut-hai.github.io/HMotionGPT/.

Mark Helpful

Bookmark

Relay