What question did this study set out to answer?

The research aims to improve fine-grained bird species recognition through an adaptive audiovisual fusion method that leverages prediction confidence.

May 22, 2026Open Access

An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition

Key Points

The research aims to improve fine-grained bird species recognition through an adaptive audiovisual fusion method that leverages prediction confidence.
Developed an adaptive audiovisual fusion framework with an image classification branch and an audio classification branch.
Utilized EfficientNet-B3 and ResNet-50 for extracting visual features and classifying audio signals, respectively.
Implemented a confidence-adaptive fusion module to assign dynamic weights to each modality's prediction based on reliability.
The image branch achieved a Top-1 accuracy of 91.55%, outperforming ResNet-50 (89.75%) and VGG-16 (83.81%).
The audio branch reached 68.20%, surpassing AST (63.29%) and VGG-16 (53.48%).
The fused model attained a Top-1 accuracy of 95.30%, improving by 3.75 percentage points over the image-only baseline.

Abstract

To address the inherent limitations of single-modality approaches in fine-grained bird species recognition, this paper proposes an adaptive audiovisual fusion method based on prediction confidence. The proposed framework comprises three core components: an image classification branch, an audio classification branch, and a confidence–adaptive fusion module. The image branch employs EfficientNet-B3 to extract fine-grained visual features through compound scaling and squeeze-and-excitation (SE) attention. The audio branch utilizes ResNet-50 to classify Mel spectrograms converted from bird vocalizations, incorporating a dense sampling inference strategy to fully exploit complete audio information. For multimodal integration, a confidence–adaptive fusion strategy is introduced that jointly considers information entropy and probability gap to dynamically assess the reliability of each modality’s prediction, thereby assigning fusion weights at the sample level without any additional trainable parameters. Experiments on the SSW60 multimodal bird recognition dataset show that the image branch achieves a Top-1 accuracy of 91.55%, outperforming ResNet-50 (89.75%) and VGG-16 (83.81%); the audio branch reaches 68.20%, surpassing AST (63.29%) and VGG-16 (53.48%); and the fused model attains 95.30% Top-1 accuracy, a 3.75 percentage-point improvement over the image-only baseline and a 0.21 percentage-point gain over the learning-based TMC fusion baseline without introducing any trainable parameters, confirming the effectiveness of the proposed method.

An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition

Key Points

Abstract

Cite This Study

Also Consider

Also Consider