In complex real‐world environments such as disaster monitoring, effective sound event detection (SED) is often hindered by the presence of noise and limited labeled data. This article presents Gate‐Align‐SED, a unified semi‐supervised framework designed to bridge the gap between clip‐level and frame‐level acoustic modeling for disaster‐related audio understanding. The proposed method integrates adaptive feature fusion, mutual attention mechanisms, and a novel label alignment strategy that introduces a learnable correlation matrix to align heterogeneous label granularities. Furthermore, we incorporate a consistency learning paradigm grounded in the Mean‐Teacher framework, promoting robust representation learning across both temporal scales and annotation levels. Experiments demonstrate that the proposed approach enhances both the flexibility and stability of SED systems, particularly under label‐sparse or noisy conditions. Our work offers a scalable and generalizable solution for leveraging both weakly labeled and unlabeled data in critical acoustic event recognition scenarios.
Chen et al. (Thu,) studied this question.