In the ecosystem, birds are important indicators that can sensitively reflect changes in the ecological environment and its health. However, bird monitoring has challenges due to species diversity, variable behaviors, and distinct morphological characteristics. Therefore, we propose a parallel dual-branch hybrid CNN–Transformer architecture for feature extraction that simultaneously captures local and global image features to address the “local feature similarity” issue in dual tasks of bird species and behaviors. The dual-task framework comprises three main components: the Token Re-segmentation Module (TRM), the Multi-scale Adaptive Module (MAM), and the Feature Interleaving Structure (FIS). The designed MAM fuses hybrid attention to address the problem of different-scale birds. MAM models the interdependencies between spatial and channel dimensions of features from different scales. It enables the model to adaptively choose scale-specific feature representations, accommodating inputs of different scales. In addition, we designed an efficient feature-sharing mechanism, called FIS, between parallel CNN branches. FIS interleaving delivers and fuses CNN feature maps across parallel layers, combining them with the features of the corresponding Transformer layer to share local and global information at different depths and promote deep feature fusion across parallel networks. Finally, we designed the TRM to address the challenge of visually similar but distinct bird species and of similar poses with distinct behaviors. TRM adopts a two-step approach: first, it locates discriminative regions, and then performs fine segmentation on them. This module enables the network to allocate relatively more attention to key areas while merging non-essential information and reducing interference from irrelevant details. Experiments on the self-made dataset demonstrate that, compared with state-of-the-art classification networks, the proposed network achieves the best performance, achieving 79.70% accuracy in bird species recognition, 76.21% in behavior recognition, and the best performance in dual-task recognition.
Zhang et al. (Sat,) studied this question.