Research on audio-visual event localization has predominantly focused on manually trimmed short videos, each containing a single isolated event, which limits its applicability in real-world scenarios. In this paper, we tackle a more realistic and challenging problem known as dense audio-visual event localization (DAVEL). This task aims to detect and classify audio-visual events of varying durations that may overlap in time within untrimmed videos. Successfully addressing this task requires not only fine-grained perceptual capabilities but also a comprehensive understanding of cross-modal interactions and temporal dependencies among events. To this end, we propose a novel framework named Temporal-Aware Multimodal Event Network (TAME-Net). It consists of two key components: the Modality-Aware Temporal Alignment (MATA) module and the Temporal Interaction and Dependency Encoding (TIDE) module. The MATA module aligns and enhances modality-specific representations through semantic-level interactions and temporal consistency across modalities. The TIDE module captures sequential dependencies by modeling contextual relationships along the temporal axis, thereby improving the model's reasoning over complex event sequences. Extensive experiments on the large-scale UnAV-100 dataset demonstrate the effectiveness of the proposed framework, achieving state-of-the-art performance on the DAVEL task. We hope this work will inspire further research in dense audio-visual event localization.
Building similarity graph...
Analyzing shared references across papers
Loading...
Y. A. Han
Menglei Yang
Shenhao Zhang
Zhengzhou University
Building similarity graph...
Analyzing shared references across papers
Loading...
Han et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68d466be31b076d99fa65a7a — DOI: https://doi.org/10.1117/12.3082748