Action Quality Assessment (AQA) has recently gained prominence as a crucial area of research, driven by advancements in deep learning and increasing demands for automated, objective evaluations of human action execution. Unlike traditional action recognition tasks, AQA involves a nuanced analysis aimed at quantifying performance quality, thereby enabling practical applications across diverse domains such as sports evaluation, rehabilitation monitoring, and professional skill assessment. This survey systematically reviews deep learning-based methodologies developed for video-based AQA, categorizing existing approaches according to data modalities, learning paradigms, and evaluation granularity. Specifically, the review covers single-modality and emerging multi-modal approaches integrating visual, auditory, and textual information, as well as supervised, self-supervised, and contrastive learning frameworks. Key benchmark datasets utilized in AQA research are comprehensively analyzed with emphasis on their scope, representativeness, and annotation characteristics. Furthermore, critical challenges confronting the field-including limited model generalization, ambiguity in scoring annotations, and interpretability concerns-are identified and discussed. Potential future research directions that could address these limitations and further advance practical AQA deployment are also proposed.
Zhang et al. (Mon,) studied this question.