Recent advances in action segmentation have greatly enhanced our understanding of complex and dynamic scenes in video content. Despite these improvements, the field continues to face persistent challenges, particularly in terms of model efficiency and the substantial cost associated with manual annotation. In this work, we introduce a novel framework that integrates active learning within hyperbolic space to effectively address these issues. By leveraging the hierarchical representational capacity of hyperbolic space, which is naturally suited for modeling structured data, and combining it with the selective efficiency of active learning, our method introduces hyperbolic uncertainty metrics to guide the targeted selection of the most informative video frames and sequences for annotation. This enables the model to prioritize annotation efforts where they are most impactful. Furthermore, the model iteratively refines pseudo labels using all available annotations, significantly reducing the need for exhaustive labeling while preserving high segmentation accuracy. To further mitigate reliance on precise annotations, we enhance the MS-TCN model by incorporating soft pseudo labels and a weighting mechanism that dynamically adjusts learning based on label confidence, allowing for more robust training in the presence of noisy or weakly labeled data. Extensive experiments conducted on two widely used action segmentation benchmark datasets validate the effectiveness of our approach, demonstrating that it can substantially reduce annotation effort while maintaining overall segmentation performance.
Xiu et al. (Mon,) studied this question.