Convolutional neural network (CNN) models are widely used for environmental sound classification (ESC). However, 2-D convolutions assume translation invariance along both time and frequency axes, while in practice the frequency dimension is not shift-invariant. Additionally, single-scale convolutions limit the receptive field, leading to incomplete feature representation. To address these issues, we introduce a parallel time-frequency multi-scale attention (PTFMSA) module that integrates local and global attention across multiple scales to improve dynamic convolution in order to overcome these problems. We also introduce the parallel branch structure to avoid mutual interference of information in case of extracting time and frequency domain features. Additionally, we utilize learnable parameters that can dynamically adjust the weights of different branches during network training. Building on this module, we develop PTFMSAN, a compact network that processes raw waveforms directly for ESC. To further strengthen learning, between-class (BC) training is applied. Experiments on the ESC-50 dataset show that PTFMSAN outperforms the baseline model, achieving a classification accuracy of 90%, competitive among CNN-based networks. We also performed ablation experiments to verify the effectiveness of each module.
Wan et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: