March 18, 2024Open Access

Modal Consensus and Contextual Separation for Weakly Supervised Temporal Action Localization

Key Points

Key points are not available for this paper at this time.

Abstract

Weakly-supervised Temporal Action Localization (W-TAL) is a challenging task aiming to achieve both action class identification and localization of temporal boundaries using video-level label learning. Recent methods resort to basic cascading or integration of appearance and optical flow features, often resulting in incomplete action localization and ambiguity distinguishing foreground from background. Therefore, this paper introduces the Modal Consensus and Context Separation (MCCS) approach to address these complexities. First, the modal collaboration module proposes to enhance action feature representation by synergizing appearance and optical flow features while discarding redundant elements to eschew suboptimal outcomes. Further, these augmented bimodal streams are meticulously fused via the spatiotemporal self-attention module, which adeptly fuses spatial and temporal relationships of action snippets. In addition, the hybrid modeling mechanism is employed for foreground-background separation, focusing on local action attributes within hybrid features to refine the differentiation between foreground and background. This paper substantiates the efficacy of the MCCS method through rigorous testing on the THUMOS14 and ActivityNet1.3 datasets, demonstrating its superiority in tackling the intricate facets of W-TAL.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper