Los puntos clave no están disponibles para este artículo en este momento.
This study explores unsupervised open-vocabulary skeleton action recognition, aiming at addressing inaccurate spatial matching and poor interpretability of existing GCN models. We present Skeleton-DGCFA, an approach to make feature alignment (FA) of skeleton with image modalities based on a large pre-trained vision and language (VL) model along with our new diffusion graph convolutional (DGC) skeleton encoder. The DGC comprises spatial and temporal convolutional modules, allowing for the diffusion of different graph semantic features. Skeleton-DGCFA harnesses recent large-scale VL models and extends their zero-shot capabilities to the skeleton modality by capitalizing on its natural pairing with images. The open-vocabulary zero-shot capabilities improve with the strength of the pre-trained VL model and our DGC skeleton encoder. We establish a new state-of-the-art in the zero-shot skeleton action recognition tasks, significantly surpassing the vanilla zero-shot method by 27.0% and 19.7% on NTU-60 and NTU-120, respectively.
Wei et al. (Mon,) studied this question.