Acoustic scene classification (ASC) focuses on recognizing and characterizing acoustic environments. Convolutional neural networks (CNNs) are extensively employed in ASC due to their capability to learn local time-frequency information from spectrograms. High-performance CNNs typically necessitate a substantial quantity of parameters; however, current ASC systems are predominantly deployed on lightweight devices, which restricts the efficacy of high-performance CNNs under low-parameter constraints. This paper presents a low-complexity multi-scale pyramid pooling (MSP) strategy for CNNs, implemented across convolutional layers of varying depths, to enhance the performance of baseline CNNs under limited parameter constraints. Specifically, MSP analyzes the contribution of various sound events to specific scenes by capturing the correlation information among local feature maps with varying time-frequency details. Experimental results on multiple ASC datasets demonstrate that MSP modules considerably improve the performance of baseline CNNs, with only 4.99k additional parameters yielding performance improvements of 5.26% and 4.38% on the DCASE 2019, and DCASE 2020 datasets, respectively. These results demonstrate that the proposed MSP module can effectively improve the performance of resource-constrained ASC systems, and has potential applications in real-world scenarios such as intelligent surveillance, smart wearable devices, and edge-based audio monitoring systems.
Jiang et al. (Tue,) studied this question.