What question did this study set out to answer?

The research aims to improve keypoint detection by modeling relationships using class-agnostic visual priors.

January 18, 2026Open Access

Exploiting Class-agnostic Visual Prior for Few-shot Keypoint Detection

Key Points

The research aims to improve keypoint detection by modeling relationships using class-agnostic visual priors.
Developed a few-shot keypoint detection approach utilizing class-agnostic visual priors.
Created a Visual Prior guided Vision Transformer (VPViT) incorporating refined visual priors.
Investigated transductive learning approaches to enhance keypoint representations with unlabeled data.
Implemented masking and alignment techniques to boost robustness against occlusions.
Demonstrated improved accuracy on seven public datasets for few-shot keypoint detection.
Significantly enhanced performance during transductive inference and in occluded conditions.

Abstract

Abstract Deep learning based keypoint detectors can localize specific object (or body) parts well, but still fall short of general keypoint detection. Instead, few-shot keypoint detection (FSKD) is an underexplored yet more general task of localizing either base or novel keypoints, depending on the prompted support samples. In FSKD, how to build robust keypoint representations is the key to success. To this end, we propose an FSKD approach that models relations between keypoints. As keypoints are located on objects, we exploit a class-agnostic visual prior, i.e ., the unsupervised saliency map or DINO attentiveness map to obtain the region of focus within which we perform relation learning between object patches. The class-agnostic visual prior also helps suppress the background noise largely irrelevant to keypoint locations. Then, we propose a novel Visual Prior guided Vision Transformer (VPViT). The visual prior maps are refined by a bespoke morphology learner to include relevant context of objects. The masked self-attention of VPViT takes the adapted prior map as a soft mask to constrain the self-attention to foregrounds. As robust FSKD must also deal with the low number of support samples and occlusions, based on VPViT, we further investigate i) transductive FSKD to enhance keypoint representations with unlabeled data and ii) FSKD with masking and alignment (MAA) to improve robustness. We show that our model performs well in seven public datasets, and also significantly improves the accuracy in transductive inference and under occlusions. Source codes are available at https://github.com/AlanLuSun/VPViT .

Exploiting Class-agnostic Visual Prior for Few-shot Keypoint Detection

Key Points

Abstract

Cite This Study