We propose an occlusion-aware framework for human pose estimation based on temporal point-cloud sequences. Training data are generated via simulation and augmented with synthetic occlusions using Perlin-noise masks. The network combines PointNet++ for spatial features extraction, a Transformer for temporal encoding, and a graph convolutional network with inverse DCT for skeletal reconstruction. We evaluate the method against an RGB-based baseline (MediaPipe) under real robot-induced occlusions using 128 annotated frames of right-hand reaching. The proposed method achieves significantly lower errors than the baseline at the shoulder and elbow. An ablation study shows that occlusion augmentation significantly improves performance under occlusion. Visibility analysis further indicates that, after multiple-comparison correction, error-visibility correlations remain for the baseline but not for the proposed method, suggesting reduced sensitivity to occlusion. These results demonstrate the potential of simulation-to-real training for robust single-sensor pose estimation in assistive robotics.
TAKASE et al. (Tue,) studied this question.