Weakly-supervised Temporal Action Localization (WTAL) aims to accurately localize and classify action instances in untrimmed long videos using only video-level annotations. Although most existing WTAL methods leverage pre-trained feature extractors to obtain RGB and optical flow features–thereby reducing computational costs–this strategy suffers from two critical limitations: (1) limited temporal receptive fields, resulting in inadequate exploitation of contextual information; and (2) interference from irrelevant background content, which degrades overall performance. To address these issues, we propose a Feature-Enhanced Network (FE-Net), which comprises three key components: the Local Feature Expansion and Enhancement Module (LF-EEM), the Cross-modal Fusion Enhancement Module (CEM), and the Cross-temporal Gated Feature Fusion Module (CGFF). Specifically, LF-EEM expands the temporal receptive field to better capture complete action instances. CEM leverages the complementary nature of auxiliary and primary modalities to suppress background noise in the primary modality through cross-modal fusion. Furthermore, CGFF employs a cross-temporal gating mechanism during feature fusion to emphasize salient changes across time, replacing simple concatenation. Extensive experiments conducted on two large-scale benchmark datasets, THUMOS-14 and ActivityNet v1.2, demonstrate that FE-Net significantly enhances the performance of existing WTAL methods. These results validate the effectiveness of our proposed modules and provide new insights for advancing temporal action localization under weak supervision.
Zhang et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: