Action Quality Assessment (AQA) is a growing field in computer vision that focuses on objectively evaluating human actions from videos, with applications across various domains. Current approaches typically provide only a single overall score, which lacks the granular details necessary for actionable performance feedback. This limitation is compounded by the scarcity of fine-grained annotations; While a few publicly available datasets contain sub-action temporal boundaries, none provide explicit sub-score labels. This paper introduces a novel framework that addresses these challenges by decomposing actions into interpretable sub-actions and leveraging self-supervised learning to enhance feature representations. An unsupervised temporal segmentation module first partitions a video into semantically meaningful sub-actions. Subsequently, a self-supervised learning mechanism refines the initial spatio-temporal features, making them more robust to temporal irregularities and more discriminative for subtle motion nuances. These robust features are then used in a progressive pseudo-subscore learning mechanism that explicitly models the sequential dependencies between sub-actions, generating fine-grained feedback that differentiates between short-range causal effects and cumulative long-range influences. The efficacy of the proposed framework is validated through comprehensive experiments on the UNLV-Diving and FineDiving datasets. The results demonstrate state-of-the-art performance on the Spearman’s Rank Correlation (SRC) metric, confirming that robust feature representations and explicit temporal modeling are crucial for accurate assessment.
Mazruei et al. (Fri,) studied this question.