What does this research mean for the field?

The proposed multi-grained detail-enhanced and patch-aware network significantly improves the accuracy of bird sound recognition, achieving accuracies of 96.29%, 86.51%, and 97.40% on three datasets. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to improve bird sound recognition by enhancing feature extraction despite environmental noise.

March 5, 2026Open Access

Multi-grained detail-enhanced and patch-aware network based on bird sound recognition

Key Points

The study aims to improve bird sound recognition by enhancing feature extraction despite environmental noise.
Utilized a densely connected time delay neural network as the backbone.
Proposed multi-grained detail-enhanced convolution combining various convolution techniques.
Implemented multi-grained pooling strategies to capture fine-grained acoustic features.
Introduced branch patch-aware attention module for local and global information capture.
Achieved recognition accuracies of 96.29%, 86.51%, and 97.40% on three datasets.
Demonstrated improved capability in capturing audio feature information.

Abstract

Combining deep learning and bird sound recognition strongly supports monitoring bird species and maintaining ecological balance. However, in outdoor environments, the extraction of bird sound features is often hindered by environmental noise, making it challenging for models to learn the fine-grained features of bird sounds fully. And single-scale feature extraction is harrowing to cover the time–frequency domain feature information of bird sounds in multiple dimensions. To address these issues, this paper proposes a multi-grained detail-enhanced and patch-aware network. The model utilizes densely connected time delay neural network as the backbone network and introduces the multi-grained detail-enhanced convolution, which combines vanilla convolutions with differential convolutions in the horizontal, vertical, angular, and central levels, and incorporates multi-grained pooling strategies to learn fine-grained acoustic features at different levels. To further overcome the limitations of single-scale feature extraction, the branch patch-aware attention module is proposed. This module collaboratively captures local details and global contextual information through a multi-branch structure and patch partitioning of different sizes. On the three datasets, the method achieved accuracies of 96.29%, 86.51%, and 97.40%, respectively. This achievement demonstrates the precise capture and parsing ability of the method for audio feature information.

Multi-grained detail-enhanced and patch-aware network based on bird sound recognition

Key Points

Abstract

Cite This Study