What question did this study set out to answer?

The research aims to create a deep learning model for precise classification of ear diseases using otoscopic images.

March 30, 2026Open Access

BioOtoFusionNet – A bio-attention & frequency-aware hybrid network for ear disease classification

Key Points

The research aims to create a deep learning model for precise classification of ear diseases using otoscopic images.
Developed a dual-branch architecture with frequency-aware and shape-aware streams.
Utilized discrete wavelet transform to enhance frequency features.
Employed multi-scale attention pooling and an adaptive cross-fusion module for representation fusion.
Conducted stratified five-fold cross-validation with training and evaluation sample separation.
Achieved an overall accuracy of 96.8% and an F1-score of 95.6%.
AUC-ROC score of 98.4%, surpassing all tested model variants.
Specific accuracy for Earwax Plug at 97.8% and Normal Ear at 97.1%.
Strong robustness to illumination changes and low ambiguity in diagnoses.

Abstract

Automated interpretation of otoscopic images is challenging due to subtle textural variations, anatomical complexity, and inconsistent acquisition conditions. This study aims to develop an accurate and interpretable deep learning framework for ear disease classification. This work presents BioOtoFusionNet, a clinically motivated dual-branch architecture integrating a Frequency-Aware Stream based on Discrete Wavelet Transform (DWT) sub-bands and a Shape-Aware Stream that captures anatomical structures using edge-based features and capsule attention. An Adaptive Cross-Fusion Module (ACFM) and Multi-Scale Attention Pooling (MSAP) are employed to effectively fuse complementary representations across spatial resolutions. BioOtoFusionNet achieved an overall accuracy of 96.8%, an F1-score of 95.6%, and an AUC-ROC of 98.4%, outperforming all ablated variants. High class-wise accuracy was observed for Earwax Plug (97.8%), Myringosclerosis (94.5%), Chronic Otitis Media (93.2%), and Normal Ear (97.1%). Clinically motivated interpretability metrics demonstrated balanced reliance on texture and structure (FHI = 0.60, SAR = 1.25), strong attention localization (ALS = 0.72), and stable multi-scale behaviour (MSAC = 0.87). Robustness analysis showed resilience to illumination variations (RII = 0.15) and low diagnostic ambiguity (DCS = 0.31). Evaluation was conducted on a four-class otoscopic dataset using stratified five-fold cross-validation with strict separation between training and evaluation samples. Data augmentation was applied only to training subsets to prevent information leakage. BioOtoFusionNet provides accurate, interpretable, and robust ear disease classification from otoscopic images, highlighting its potential for clinical decision support and telemedicine-based otologic screening.

BioOtoFusionNet – A bio-attention & frequency-aware hybrid network for ear disease classification

Key Points

Abstract

Cite This Study