Temporal action localization is a fundamental task in video understanding that focuses on classifying and temporally localizing action instances in untrimmed videos. Compared to temporal action localization, the Weakly-supervised Temporal Action Localization (WTAL) task presents greater challenges, as its training data lacks detailed information about action boundaries. Existing WTAL methods ignore the complementary relationship between modalities and the dependency between snippets, resulting in inaccurate localization results. To solve these issues, we propose a Collaborative Hierarchical Aggregation Network (CHA-Net). Specifically, we first use a modality complementary module to learn the synergies between modalities. Then a collaborative enhance module is proposed to remove the information irrelevant to actions in RGB modality. Finally, a hierarchical aggregation module is proposed to capture the complete temporal information of action instances to better mine the temporal dependencies between snippets. Extensive experiments on THUMOS14, ActivityNet1. 2 and ActivityNet1. 3 datasets demonstrate the effectiveness of our method. Compared with F3-Net (TMM2024, Avg0. 1: 0. 5) and SPCC-Net (TMM2024, Avg0. 1: 0. 7) on the THUMOS14 dataset, the proposed method can achieve improvements of 3. 2% and 2. 4%, respectively.
A Thu, study studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: