Key points are not available for this paper at this time.
It is still difficult to accurately extract smooth and consistent 3D human motion from video footage over time. While some existing techniques have achieved favorable outcomes by utilizing the combined features of consecutive frames, many of them compromise accuracy in order to reduce jitter or do not have a complete understanding of the temporal nature of human movement. To this end, we model the natural smoothing properties in body motion by learning the long-range temporal relationship between the kinematic features of the human body in the video and the enhanced current frame features. First, we use the velocity and acceleration of key points to effectively capture temporal features as our temporal motion prior, and then we have created a module that uses a hierarchical attention mechanism to improve the representation of the current frame by selectively focusing on important temporal information from both past and future frames. This enhances the correlation between frames and improves the overall quality of the feature representation. Ultimately these two parts of features are aggregated together through a global motion aware network and linear fusion is performed to obtain the final accurate 3D human motion.
zhang et al. (Mon,) studied this question.
Synapse has enriched one closely related paper. Consider it for comparative context: