Human pose estimation (HPE) is a fundamental challenge in computer vision, aiming to detect anatomical keypoints in images. Traditional methods rely on CNN models, but recent advancements in Vision Transformer (ViT) models have shown superior performance. However, ViTs often require substantial computational resources. This paper introduces SPTPose, a method that employs self-distillation and token pruning to reduce computational costs while maintaining high performance. Our SPTPose-B achieves a mAP of 74.8% on the MSCOCO validation set with only 13.2 million parameters and 4.7 GFLOPs. The source code is available at https://github.com/duduxx123/SPTPose.
ZHANG et al. (Sun,) studied this question.