Combining deep learning and bird sound recognition strongly supports monitoring bird species and maintaining ecological balance. However, in outdoor environments, the extraction of bird sound features is often hindered by environmental noise, making it challenging for models to learn the fine-grained features of bird sounds fully. And single-scale feature extraction is harrowing to cover the time–frequency domain feature information of bird sounds in multiple dimensions. To address these issues, this paper proposes a multi-grained detail-enhanced and patch-aware network. The model utilizes densely connected time delay neural network as the backbone network and introduces the multi-grained detail-enhanced convolution, which combines vanilla convolutions with differential convolutions in the horizontal, vertical, angular, and central levels, and incorporates multi-grained pooling strategies to learn fine-grained acoustic features at different levels. To further overcome the limitations of single-scale feature extraction, the branch patch-aware attention module is proposed. This module collaboratively captures local details and global contextual information through a multi-branch structure and patch partitioning of different sizes. On the three datasets, the method achieved accuracies of 96.29%, 86.51%, and 97.40%, respectively. This achievement demonstrates the precise capture and parsing ability of the method for audio feature information.
Duan et al. (Mon,) studied this question.