What question did this study set out to answer?

This research aims to improve the accuracy of action localization in videos with weak supervision by enhancing feature extraction methods.

February 23, 2026Open Access

Weakly Supervised Temporal Action Localization Based on Feature Enhancement

Key Points

This research aims to improve the accuracy of action localization in videos with weak supervision by enhancing feature extraction methods.
Developed a Feature-Enhanced Network (FE-Net) to process video data.
Created the Local Feature Expansion and Enhancement Module (LF-EEM) to increase temporal receptive fields.
Designed the Cross-modal Fusion Enhancement Module (CEM) to reduce background noise from video features.
Implemented the Cross-temporal Gated Feature Fusion Module (CGFF) to better highlight important changes over time.
FE-Net demonstrated significant performance improvements in WTAL methods on THUMOS-14 and ActivityNet v1.2 datasets.
The proposed enhancements effectively outperformed traditional feature extraction techniques.
Results indicate better action instance localization and classification accuracy, contributing to advancements in weakly supervised learning.

Abstract

Weakly-supervised Temporal Action Localization (WTAL) aims to accurately localize and classify action instances in untrimmed long videos using only video-level annotations. Although most existing WTAL methods leverage pre-trained feature extractors to obtain RGB and optical flow features–thereby reducing computational costs–this strategy suffers from two critical limitations: (1) limited temporal receptive fields, resulting in inadequate exploitation of contextual information; and (2) interference from irrelevant background content, which degrades overall performance. To address these issues, we propose a Feature-Enhanced Network (FE-Net), which comprises three key components: the Local Feature Expansion and Enhancement Module (LF-EEM), the Cross-modal Fusion Enhancement Module (CEM), and the Cross-temporal Gated Feature Fusion Module (CGFF). Specifically, LF-EEM expands the temporal receptive field to better capture complete action instances. CEM leverages the complementary nature of auxiliary and primary modalities to suppress background noise in the primary modality through cross-modal fusion. Furthermore, CGFF employs a cross-temporal gating mechanism during feature fusion to emphasize salient changes across time, replacing simple concatenation. Extensive experiments conducted on two large-scale benchmark datasets, THUMOS-14 and ActivityNet v1.2, demonstrate that FE-Net significantly enhances the performance of existing WTAL methods. These results validate the effectiveness of our proposed modules and provide new insights for advancing temporal action localization under weak supervision.

Demander à l'IA

Bookmark

View Full Paper