What type of study is this?

This is a Experimental Study study.

October 22, 2025Open Access

MSFDnet: A Multi-Scale Feature Dual-Layer Fusion Model for Sound Event Localization and Detection

Puntos clave

MSDFnet achieves improved sound event localization and detection accuracy in complex audio scenarios.
Achieved SELDscore of 0.198 on the DCASE2020 Task 3 dataset, showing significant enhancement in performance.
Integrates multi-scale feature aggregation with dual-layer feature fusion to better capture spatial features.
Results validate the effectiveness of MSFA and DLFF modules in advancing sound event detection and localization tasks.

Resumen

The task of Sound Event Localization and Detection (SELD) aims to simultaneously address sound event recognition and spatial localization. However, existing SELD methods face limitations in long-duration dynamic audio scenarios, as they do not fully leverage the complementarity between multi-task features and lack depth in feature extraction, leading to restricted system performance. To address these issues, we propose a novel SELD model—MSDFnet. By introducing a Multi-Scale Feature Aggregation (MSFA) module and a Dual-Layer Feature Fusion strategy (DLFF), MSDFnet captures rich spatial features at multiple scales and establishes a stronger complementary relationship between SED and DOA features, thereby enhancing detection and localization accuracy. On the DCASE2020 Task 3 dataset, our model achieved scores of 0.319, 76%, 10.2°, 82.4%, and 0.198 in ER20, F20, LEcd, LRcd, and SELDscore metrics, respectively. Experimental results demonstrate that MSDFnet performs excellently in complex audio scenarios. Additionally, ablation studies further confirm the effectiveness of the MSFA and DLFF modules in enhancing SELD task performance.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo