March 3, 2026Open Access

A real-time bird sound recognition app via deep learning techniques

Key Points

ResNet-18 achieved an overall accuracy of 0.955, showcasing its effectiveness in bird sound recognition.
The model processes sound in just 25.9 milliseconds, making it suitable for real-time applications.
Training included techniques like class weighting and data augmentation, optimizing performance for mobile use.
This framework potentially transforms conservation efforts by enabling offline, high-accuracy sound classification on devices.

Abstract

This study presents a real-time, on-device bird sound recognition system developed using deep transfer learning and optimized for mobile deployment. A curated Xeno-canto corpus, an open-access repository of wildlife sound recordings contributed by citizen scientists worldwide, comprising 610 Taiwanese bird species was used to evaluate six deep learning architectures: Residual Network-18 (ResNet-18), Yet Another Mobile Network (YAMNet), Visual Geometry Group-like Network for Audio Classification (VGGish), Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM), Attention-based Convolutional Neural Network (Attention-CNN), and a Deep Neural Network (DNN) baseline. All models were trained using class weighting, batch normalization, a dropout rate of 0. 2, and targeted data augmentation, including pitch shifting (±2 semitones), time stretching (0. 8–1. 2), and time shifting (16, 000 samples). Among these, ResNet-18 achieved the best balance between accuracy and computational efficiency, with an overall accuracy of 0. 955, macro-precision of 0. 95, macro-recall of 0. 94, and macro-F1 of 0. 945 across all 610 classes. The model performs inference in 25. 9 milliseconds with only 3. 03 megabytes of memory (approximately 795, 000 parameters), outperforming heavier architectures such as VGGish (0. 8975 accuracy, 42. 2 milliseconds, 587 megabytes) while remaining competitive with compact alternatives like YAMNet (0. 935 accuracy, 27. 0 milliseconds, 10. 19 megabytes). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations confirm that predictions are driven by species-specific temporal–spectral patterns rather than background noise. Converting the optimized model to TensorFlow Lite enables fully offline inference on Android devices, eliminating cloud latency and ensuring user privacy. Overall, this lightweight, high-accuracy framework offers a scalable and practical solution for real-time biodiversity monitoring and conservation research.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Social Feed

Authors

Hailemariam Abebe Endalamaw

C. C. Yang

Cheng-Hung Hsu

Journals

Multimedia Tools and Applications

Actions

Institutions

National Taiwan University of Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A real-time bird sound recognition app via deep learning techniques

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Social Feed

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider