November 30, 2025

A Collaborative Hierarchical Aggregation Network for Weakly-Supervised Temporal Action Localization

Key Points

Abstract

Temporal action localization is a fundamental task in video understanding that focuses on classifying and temporally localizing action instances in untrimmed videos. Compared to temporal action localization, the Weakly-supervised Temporal Action Localization (WTAL) task presents greater challenges, as its training data lacks detailed information about action boundaries. Existing WTAL methods ignore the complementary relationship between modalities and the dependency between snippets, resulting in inaccurate localization results. To solve these issues, we propose a Collaborative Hierarchical Aggregation Network (CHA-Net). Specifically, we first use a modality complementary module to learn the synergies between modalities. Then a collaborative enhance module is proposed to remove the information irrelevant to actions in RGB modality. Finally, a hierarchical aggregation module is proposed to capture the complete temporal information of action instances to better mine the temporal dependencies between snippets. Extensive experiments on THUMOS14, ActivityNet1. 2 and ActivityNet1. 3 datasets demonstrate the effectiveness of our method. Compared with F3-Net (TMM2024, Avg0. 1: 0. 5) and SPCC-Net (TMM2024, Avg0. 1: 0. 7) on the THUMOS14 dataset, the proposed method can achieve improvements of 3. 2% and 2. 4%, respectively.

Ask AI

Helpful

Bookmark

Cite This Study

A Thu, study studied this question.

synapsesocial.com/papers/692b9d9a1d383f2b2a37a0ff https://doi.org/https://doi.org/10.1145/3778170