Temporal action detection (TAD) aims to localize and recognize action instances in untrimmed videos, and serves as a key component in practical intelligent electronic systems such as smart video surveillance and real-time human–machine interaction. In these scenarios, accurate temporal localization is essential for reliable event understanding and downstream decision-making in edge computing and real-time streaming scenarios. To handle long video durations and diverse action dynamics, existing methods typically rely on hierarchical temporal feature integration to capture multi-scale contextual information. However, such integration often leads to intra-segment inconsistency and boundary ambiguity, as indiscriminate temporal smoothing across scales degrades segment coherence and blurs critical boundary cues. In this work, we propose FreqAct, a multi-frequency feature fusion framework that explicitly models complementary low-frequency and high-frequency temporal components within hierarchical representations. Specifically, low-frequency modulation suppresses undesired temporal fluctuations to stabilize segment-level representations, while high-frequency enhancement preserves boundary-sensitive cues essential for precise localization. Furthermore, we introduce a boundary-aware regression loss to emphasize learning at action boundaries and an intra-segment consistency regularization to encourage coherent predictions within each action instance. Extensive experiments on THUMOS14 and ActivityNet1.3 demonstrate that FreqAct outperforms state-of-the-art TAD methods, further highlighting its effectiveness and practical potential for real-world electronics applications.
Li et al. (Sun,) studied this question.