What question did this study set out to answer?

This research aims to improve skeleton-based action recognition by addressing limitations in traditional methods with a novel hybrid framework.

April 24, 2026Open Access

A hybrid framework combining adaptive graph learning and global temporal attention for skeleton-based action recognition

Key Points

This research aims to improve skeleton-based action recognition by addressing limitations in traditional methods with a novel hybrid framework.
Developed a hybrid architecture combining adaptive graph learning and global Transformer attention.
Utilized embedded Gaussian adaptive graphs for discovering action-specific joint relationships.
Conducted extensive experimentation on NTU-RGB+D 60 and NTU-RGB+D 120 datasets.
Achieved 91.5% cross-subject accuracy, significantly improving over traditional GCN methods (CTR-GCN: +0.7%, MS-G3D: +2.1%) and Transformer baselines (ST-TR: +2.8%).
Ablation studies showed adaptive graphs contributing +5.8% and Transformers +5.3% to accuracy, with their combination yielding additional +2.5–3.0% gains.
Cross-dataset evaluation on NTU-RGB+D 120 confirmed generalization capability with an accuracy of 87.3%.

Abstract

Current skeleton-based action recognition methods face two critical bottlenecks: rigid anatomical graph topologies that cannot capture action-specific joint coordination, and local temporal receptive fields that miss long-range motion dependencies. We address these limitations through a hybrid architecture that synergistically combines data-driven adaptive graph learning with global Transformer attention. Our key innovation lies in embedded Gaussian adaptive graphs that discover non-physical joint relationships (e. g. , contralateral limb coordination) through learned similarity functions, complemented by a Transformer encoder-decoder with learnable action queries that selectively extract discriminative temporal phases. Extensive experiments on NTU-RGB+D 60 demonstrate 91. 5% cross-subject accuracy, achieving statistically significant improvements (p < 0. 001) over state-of-the-art GCN methods (CTR-GCN: +0. 7%, MS-G3D: +2. 1%) and Transformer baselines (ST-TR: +2. 8%). Ablation studies quantify synergistic effects: adaptive graphs contribute +5. 8%, Transformers contribute +5. 3%, and their combination yields additional +2. 5–3. 0% gains, validating complementary spatial-temporal modeling. Cross-dataset evaluation on NTU-RGB+D 120 (87. 3% accuracy) confirms generalization capability. With 4. 7M parameters and 230 ms/sample processing time, the model is suitable for offline biomechanics analysis and motion understanding applications.

Bookmark

View Full Paper