Abstract 3D human pose estimation (HPE) is a cornerstone task in computer vision with diverse applications, where lifting 2D pose sequences to 3D representations has attracted significant interest. Transformer-based approaches have demonstrated robust performance but are hampered by quadratic computational complexity and insufficient bidirectional modeling capabilities. The recently introduced Mamba model mitigates these limitations through state-space models (SSMs) offering linear complexity and effective long-range dependencies; however, it falls short in modeling local skeletal interactions essential for human motion.To address this, we present BSTMamba, a bidirectional spatiotemporal SSM architecture designed specifically for monocular 3D HPE. BSTMamba integrates efficient global sequence modeling with localized convolutions and dynamic gating mechanisms to capture intricate spatiotemporal dependencies. For enhanced robustness and generalization, we introduce DisruptEnhance, a residual-compensated joint-order perturbation module that randomly disrupts joint orders at both global (full-skeleton) and local (body-part) scales, followed by feature compensation via a lightweight residual subnet. Comprehensive evaluations on the Human3.6M and MPI-INF-3DHP datasets reveal that BSTMamba attains state-of-the-art accuracy while requiring fewer parameters and lower multiply-accumulate operations (MACs) compared to prior methods.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chuhan Wu
University of Technology Sydney
Zan Wang
Hebei Medical University
Gengze Zhou
Australian Institute of Business
Henan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Thu,) studied this question.
synapsesocial.com/papers/68c189d29b7b07f3a061337d — DOI: https://doi.org/10.21203/rs.3.rs-7477209/v1