March 3, 2026Open Access

Semantic-aware self-supervised learning using progressive sub-action regression for action quality assessment

Key Points

The proposed framework generates fine-grained feedback for action quality assessment by decomposing actions into sub-actions, enhancing clarity.
State-of-the-art performance was achieved on the Spearman’s Rank Correlation metric with validated outcomes on two datasets.
An unsupervised temporal segmentation module effectively partitions videos into meaningful sub-actions for improved evaluation.
This novel approach indicates that robust feature representations and temporal modeling significantly enhance action quality assessments.

Abstract

Action Quality Assessment (AQA) is a growing field in computer vision that focuses on objectively evaluating human actions from videos, with applications across various domains. Current approaches typically provide only a single overall score, which lacks the granular details necessary for actionable performance feedback. This limitation is compounded by the scarcity of fine-grained annotations; While a few publicly available datasets contain sub-action temporal boundaries, none provide explicit sub-score labels. This paper introduces a novel framework that addresses these challenges by decomposing actions into interpretable sub-actions and leveraging self-supervised learning to enhance feature representations. An unsupervised temporal segmentation module first partitions a video into semantically meaningful sub-actions. Subsequently, a self-supervised learning mechanism refines the initial spatio-temporal features, making them more robust to temporal irregularities and more discriminative for subtle motion nuances. These robust features are then used in a progressive pseudo-subscore learning mechanism that explicitly models the sequential dependencies between sub-actions, generating fine-grained feedback that differentiates between short-range causal effects and cumulative long-range influences. The efficacy of the proposed framework is validated through comprehensive experiments on the UNLV-Diving and FineDiving datasets. The results demonstrate state-of-the-art performance on the Spearman’s Rank Correlation (SRC) metric, confirming that robust feature representations and explicit temporal modeling are crucial for accurate assessment.

Semantic-aware self-supervised learning using progressive sub-action regression for action quality assessment

Key Points

Abstract

Cite This Study