This paper presents a methodology that employs inductive spatial geometric deep learning networks to detect multiple avian vocalizations from field recordings. Initially, a graph is constructed from the Mel-spectrogram of each audio file using a trained deep convolutional neural network (Deep CNN). The extracted features are used to build a node-feature graph, which is then processed by two spatial inductive graph-based models: graph sample and aggregation (GraphSAGE) and the graph attention network (GAT), for multi-label classification. To enhance the robustness and generalization of the Deep CNN, SpecAugment is applied to generate additional Mel-spectrograms via data augmentation. The proposed framework is evaluated on the Xeno-canto bird sound database and compared against state-of-the-art methods. The results demonstrate that the proposed inductive spatial graph-based approach outperforms existing techniques, achieving macro F1-scores of 0.90 with GraphSAGE and 0.92 with GAT. We further replaced Deep CNN with AudioProtoPNet-20 and evaluated GAT on the Xeno-canto dataset, obtaining a macro F1-score of 0.93.
Noumida et al. (Wed,) studied this question.