Video moment retrieval (VMR) is a key cross-modal task with broad theoretical and practical applications. While fully supervised methods deliver strong performance, they are constrained by the high cost of temporal boundary annotations. Weakly supervised methods mitigate this issue but suffer from limited accuracy due to coarse supervision. Recently, point-supervised approaches that leverage single-frame annotations as a cost-effective alternative have emerged as a promising paradigm. However, these methods often fail to leverage annotated frames for cross-modal semantic alignment. Additionally, they overlook global video structures and hierarchical segment relationships, leading to suboptimal retrieval accuracy under sparse supervision. To address these challenges, we propose the adaptive dual-stage tree construction (ADTC) model, a novel framework designed specifically for point-supervised VMR. First, the model introduces a dual-stage hypothesis tree architecture that seamlessly integrates local and global trees, enabling the effective modeling of semantic relationships across multiple temporal scales. Second, it incorporates frame clustering and scene segmentation to extract the structural characteristics of video content, providing a foundation for comprehensive node relevance evaluation and an adaptive merging control strategy to optimize tree construction. Third, a hierarchical adaptive tree pruning strategy is implemented, combined with a novel proposal selection mechanism for distinguishing between positive and negative samples. These components are jointly optimized through a multilevel loss function, enabling enhanced semantic alignment and retrieval performance. The experimental results demonstrate that ADTC achieves state-of-the-art performance on the Charades-STA and ActivityNet Captions datasets with the point-supervised setting. On Charades-STA, it reaches an R@1 values of 50.28% at IoU=0.5 and 34.79% at IoU=0.7, surpassing those of other point-supervised methods. On ActivityNet Captions, ADTC achieves R@1 values of 65.02% at IoU=0.3 and 46.13% at IoU=0.5, setting new benchmarks. Notably, it outperforms fully supervised methods while significantly reducing annotation costs. Ablation studies confirm the effectiveness of each model component.
Fang et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: